 Welcome back everybody. I'm sure you've noticed, but there's been a lot of cool activity on the forum this week. And one of the things that's been really great to see is that a lot of you have started creating really helpful materials both for your classmates to better understand stuff and also for you to better understand stuff by trying to teach what you've learned. I just wanted to highlight a few. I've actually posted to the Wiki thread a few of these, but there's lots more. Rashima has posted a whole bunch of nice introductory tutorials. So for example, if you're having any trouble getting connected with AWS, she's got a whole step by step how to go about logging in and getting everything working, which I think is a really terrific thing. And so it's a kind of thing that if you are writing some notes for yourself to remind you how to do it, you may as well post them for others to do it as well by using a markdown file like this. It's actually good practice if you haven't used GitHub before. If you put it up on GitHub, everybody can now use it. Or of course, you can just put it in the forum. So a more advanced thing that Rashima wrote up about is she noticed that I like using Tmux, which is a handy little thing which lets me basically have a window. Let's see if I've got one I'll show you. So as soon as I log into my computer, if I run Tmux, you'll see that all of my windows pop straight up basically and I can continue running stuff in the background. And I've got Vim over here and I can kind of zoom into it or I can move over to the top, which is here's my Jupyter kernel running and so forth. So if that sounds interesting, Rashima has a tutorial here on how you can use Tmux. And she's actually got a whole bunch of stuff in her GitHub, so that's really cool. Upill Tmung has written a very nice kind of summary basically of our last lesson, which kind of covers what are the key things we did and why did we do them. So if you're kind of wondering like how does it fit together, I think this is a really helpful summary of like what did those couple of hours look like if we summarize it all into a page or two. I also really like Pavell has kind of done a deep dive on the learning rate finder, which is a topic that a lot of you have been interested in learning more about, particularly those of you who have done deep learning before have realized that this is like a solution to a problem that you've been having for a long time and haven't seen before. And so it's kind of something which hasn't really been blogged about before. So this is the first time I've seen this blog about. So when I put this on Twitter, a link to Pavell's post, it's been shared now hundreds of times. It's been really, really popular and viewed many thousands of times. So that's some great content. Radek has posted lots of cool stuff. I really like this Practitioners Guide to PyTorch, which again, this is more for more advanced students, but it's like digging into people who've never used PyTorch before but know a bit about numerical programming in general. It's a quick introduction to how PyTorch is different. And then there's been some interesting little bits of research. Like what's the relationship between learning rate and batch size? So one of the students actually asked me this before class and I said, ah, well, one of the other students has written an analysis of exactly that. So what he's done is basically looked through and tried different batch sizes and different learning rates and tried to see how they seem to relate together. And these are all cool experiments which you can try yourself. Ah, Radek again, he's written something, again, a kind of a research into this question. I made a claim that the stochastic gradient descent with restarts finds more generalizable parts of the function surface because they're kind of flatter and he's been trying to figure out is there a way to measure that more directly. Not quite successful yet, but a really interesting piece of research. Got some introductions to convolutional neural networks. And then something that we'll be learning about towards the end of this course but I'm sure you've noticed we're using something called ResNet and Anand Sahar actually posted a pretty impressive analysis of like what's a ResNet and why is it interesting. And this one's actually been already shared very widely around the Internet. I've seen also some more advanced students who are interested in jumping ahead can look at that. And Apil Tamang also has done something similar. So lots of stuff going on on the forums. I'm sure you've also noticed we have a beginner forum now specifically for asking questions which, you know, there's always the case that there are no dumb questions but when there's lots of people around you talking about advanced topics it might not feel that way. So hopefully the beginners forum is just a less intimidating space and if you're a more advanced student who can help answer those questions, please do. But remember when you do answer those questions try to answer in a way that's friendly to people that maybe, you know, have no more than a year of programming experience and haven't done any machine learning before. So, you know, I hope other people in the class feel like you can contribute as well and just remember all of the people we just looked at or many of them I believe have never posted anything to the Internet before. I mean, you don't have to be a particular kind of person to be allowed to blog or something. You can just jot down your notes, throw it up there and one handy thing is if you just put it on the forum and you're not quite sure of some of the details then, you know, you have an opportunity to get feedback and say like, ah, well, that's not quite how that works. You know, actually it works this way instead or that's a really interesting insight had you thought about taking this further and so forth. So what we've done so far is a kind of an introduction just as a practitioner to convolutional neural networks for images and we haven't really talked much at all about the theory or why they work or the math of them but on the other hand what we have done is seen how to build a model which actually works exceptionally well. In fact, world-class level models and we'll kind of review a little bit of that today. And then also today we're going to dig in quite a lot more actually into the underlying theory of like what is a CNN? What's a convolution? How does this work? And then we're going to kind of go through this cycle where we're going to dig, we're going to do a little intro into a whole bunch of application areas using neural nets for structured data so kind of like logistics or forecasting or, you know, financial data or that kind of thing and then looking at language applications and LP applications using recurrent neural nets and then collaborative filtering for recommendation systems and so these will all be like similar to what we've done for CNN's images. It'll be like, here's how you can get a state-of-the-art result without digging into the theory but knowing how to actually make it work. And then we're kind of going to go back through those almost in reverse order. So then we're going to dig right into collaborative filtering in a lot of detail and see how to write the code underneath and how the math works underneath and then we're going to do the same thing for structured data analysis. We're going to do the same thing for confidence for images and finally an in-depth deep dive into recurrent neural networks. So that's kind of where we're heading. So let's start by doing a little bit of a review and I want to also provide a bit more detail on some steps that we only briefly skipped over. So I want to make sure that we're all able to complete kind of last week's assignment which was the dog breeds. I mean to basically apply what you've learned to another data set and I thought the easiest one to do would be the dog breeds Kaggle competition. And so I want to make sure everybody has everything you need to do this right now. So the first thing is to make sure that you know how to download data. And so there's two main places at the moment we're kind of downloading data from one is from Kaggle and the other is from like anywhere else. And so I'll first of all do the Kaggle version. So to download from Kaggle we use something called Kaggle CLI which is here. And to install it I think it's already in let's just double check. Yeah so it should already be in your environment. But to make sure one thing that happens is because this is downloading from the Kaggle website through like screen scraping every time Kaggle changes the website it breaks. So anytime you try to use it and if Kaggle's website's changed recently you'll need to make sure you get the most recent version. So you can always go pip install Kaggle dash CLI minus minus upgrade. And so that'll just make sure that you've got the latest version of it and everything that it depends on. Okay and so then having done that you can follow the instructions. Actually I think Reshma was kind enough to there you go there's a Kaggle CLI feel like everything you need to know can be found at Reshma's GitHub. So basically to do that to the next step you go KG download and then you provide your username with minus U you provide your password with minus P and then minus C you did the competition name. And a lot of people in the forum is being confused about what to enter here. And so the key thing to note is that when you're at Kaggle competition after the slash C there's a specific name. Planet dash understanding dash etc. That's the name you need. Okay the other thing you'll need to make sure is that you've on your own computer have attempted to click download at least once because when you do it will ask you to accept the rules. If you've forgotten to do that KG download will give you a hint. It'll say it looks like you might have forgotten the rules. If you log into Kaggle with like a Google account like anything other than a username and password this won't work. So you'll need to click forgot password on Kaggle and get them to send you a normal password. So that's the Kaggle version. And so when you do that you end up with a whole folder created for you with all of that competition data in it. So a couple of reasons you might want to not use that. You're using a data set that's not on Kaggle. The second is that you don't want all of the data sets in a Kaggle competition. For example the Planet competition that we've been looking at a little bit we'll look at again today has data in two formats. TIFF and JPEG. The TIFF is 19 gigabytes and the JPEG is 600 megabytes. So you probably don't want to download both. So I'll show you a really cool kit which actually somebody on the forum I think was one of the MSAN students here at USF. There's a Chrome extension called curl wget. So you can just search for curl wget. And then you install it by just clicking on install and having installed extension before. And then from now on every time you try to download something so I try and download this file and I'll just go ahead and cancel it. And now you see this little yellow button that's added up here. There's a whole command here. So I can copy that and paste it into my window and hit go and there it goes. So what that does is like all of your cookies and headers and everything else needed to download that file is like saved. So this is not just useful for downloading data. It's also useful if you like try to download some, I don't know, TP show or something. Anything where you're heading behind a login or something you can grab it. And actually that is very useful for data science because quite often we want to analyze things like videos on our consoles. So this is a good trick. So there's two ways to get the data. So then having got the data you then need to build your model. So what I tend to do, like you'll notice that I tend to assume that the data is in a directory called data. That's a sub directory of wherever your notebook is. Now you don't necessarily actually want to put your data there. You might want to put it directly in your home directory or you might want to put it on another drive or whatever. So what I do is if you look inside my courses DL1 folder you'll see that data is actually a symbolic link to a different drive. So you can put it any way you like and then you can just add a symbolic link or you can just put it there directly. It's up to you. If you haven't used sim links before they're like aliases or shortcuts on the Mac or Windows very handy and there's some threads on the forum about how to use them if you want to help with that. That, for example, is also how we actually have the BAST AI modules available from the same place as our notebooks. It's just a symbolic to where they come from. Anytime you want to see where things actually point to in Linux you can just use the minus L flag to listing a directory and that'll show you where the sim links exist. It'll also show you which things are directories. So forth. So one thing which may be a little unclear based on what we've done so far is like how little code you actually need to do this end to end. So what I've got here is in a single window is an entire end to end process to get a state of the art result for cats versus dogs. The only step I've skipped is the bit where we've downloaded it from Kaggle and then where we've unzipped it. So these are literally all the steps and so we import our libraries and actually if you import this one, COMF learner that basically imports everything else so that's that. We need to tell it the path of where things are, the size that we want, the batch size that we want. So then we're going to learn a lot more about what these do very shortly but basically we say how do we want to transform our data so we want to transform it in a way that's suitable to this particular kind of model and that assumes that the photos are side on photos and that we're going to zoom in up to 10% each time. We say that we want to get some data based on paths and so remember this is this idea that there's a path called cats and a path called dogs and they're inside a path called train and a path called valid. Note that you can always overwrite these with other things so if your things are in different folders you could either rename them or you can see here there's like a train name and a vowel name you can always pick something else here. Also notice there's a test name so if you want to submit something to cattle you'll need to fill in the name the name of the folder where the test is and obviously those won't be labeled. So then we create a model from a pre-trained model it's from a ResNet 50 model using this data and then we call fit and remember by default that has all of the layers but the last few frozen and again we'll learn a lot more about what that means and so that's what that does. So that took two and a half minutes. Notice here I didn't say pre-compute equals true. Again there's been some confusion on the forums about what that means. It's only something that makes it a little faster for this first step so you can always skip it and if you're at all confused about it or it's causing you any problems just leave it off because it's just a shortcut which caches some of the intermediate steps that don't have to be recalculated each time. And remember that when we are using pre-computed activations it doesn't work. So even if you ask for a data augmentation if you've got pre-computed equals true it doesn't actually do any data augmentation because it's using the cached non-augmented activations. So in this case to keep this as simple as possible I have no pre-computed anything going on so I do three cycles of length one and then I can then unfreeze something we haven't seen before and we'll learn about in the second half is called bn-freeze for now all you need to know is that if you're using a model like a bigger, deeper model like ResNet 50 or ResNext 101 on a data set that's very very similar to ImageNet like these cats and dogs data sets in other words it's like side-on photos of standard objects of a similar size to ImageNet like somewhere between 200 and 500 pixels you should probably add this line when you unfreeze for those of you that are more advanced what it's doing is it's causing the batch normalization moving averages to not be updated but in the second half of this course you can all learn all about why we do that it's something that's not supported by any other library but it turns out to be super important anyway so we do one more epoch with training the whole network and then at the end we use test time augmentation to ensure that we get the best predictions we can and that gives us 99.45% so that's it so when you try a new data set they're basically the minimum set of steps that you would need to follow you'll notice this is assuming you already know what learning rate to use so you'd use a learning rate finder for that assuming that I know the directory layout and so forth so that's kind of a minimum set now one of the things that I wanted to make sure you had an understanding of how to do is how to use other libraries other than FastAI and so I feel like the best thing to look at is to look at Keras because Keras is a library just like FastAI sits on top of PyTorch Keras sits on top of actually a whole variety of different back ends mainly people nowadays use it with TensorFlow there's also an MXNet version there's also a Microsoft CNTK version so what I've got if you do a git pool you'll see that there's a something called Keras lesson 1 where I've attempted to replicate at least parts of lesson 1 in Keras just to give you a sense of how that works I'm not going to talk more about batch norm freeze now other than to say if you're using something which has got a number larger than 34 at the end so like ResNet 50 or ResNext 101 and you're training a data set that has very similar to ImageNet so it's like normal photos of normal sizes where the thing of interest takes up most of the frame then you probably should add the end freeze after end freeze if in doubt try trading it with and then try trading it without more advanced students will certainly talk about it on the forums this week and we will be talking about the details of it in the second half of the course when we come back to our CNN class section in the second last lesson so with Keras again we import a bunch of stuff and remember I mentioned that this idea that you've got a thing called train and a thing called valid and inside that you've got a thing called dogs and things called cats is a standard way of providing image labeled images so Keras does that too so we're just going to tell it where the training set and the validation set are I've got batch size twice what batch size to use now you're noticing Keras we need much much much more code to do the same thing more importantly each part of that code has many many many more things you have to set and if you set them wrong everything breaks right so I'll give you a summary of what they are so you're basically rather than creating a single data object in Keras we first of all have to define something called a data generator to say how to generate the data and so a data generator we basically have to say what kind of data augmentation do we want to do and we also we actually have to say what kind of normalization do we want to do so where else with fast AI we just say whatever ResNet 50 requires just do that for me please we actually have to kind of know a little bit about what's expected of us generally speaking copying and pasting Keras code from the internet is a good way to make sure you've got the right the right stuff to make that work and again it doesn't have a kind of a standard set of like here at the best data augmentation parameters to use for photos so I've copied and pasted all of this from the Keras documentation so I don't know if it's I don't think it's the best set to use at all but it's the set that they're using in their docs so having said this is how I want to generate data so horizontally flip sometimes zoom sometimes shear sometimes we then create a generator from that by taking that data generator and saying I want to generate images by looking from a directory and we pass in the directory which is of the same directory structure that FastAI uses and you'll see there's some overlaps with kind of how FastAI works here you tell it what size images you want to create you tell it what batch size you want in your mini batches and then there's something here not to worry about too much but basically if you're just got two possible outcomes you would generally say binary here if you've got multiple possible outcomes so we've only got cats or dogs so it's binary so an example of like where things get a little more complex is you have to do the same thing for the validation set so it's up to you to create a data generator that doesn't have data augmentation because obviously for the validation set unless you're using TTA that's going to stuff things up you also when you train you randomly reorder the images that are always shown in different orders to make it more random but with the validation it's vital that you don't do that because if you shuffle the validation set you then can't track how well you're doing because it's in a different order to the labels that's a basically these are the kind of steps you have to do every time with Keras so again the reason I was using ResNet54 is Keras doesn't have ResNet34 unfortunately like with like so we're going to use ResNet50 here there isn't the same idea with Keras of saying like construct a model that is suitable for this data set for me so you have to do it by hand so the way you do it is to basically say this is my base model and then you have to construct on top of that manually the layers that you want to add and so by the end of this course you'll understand why it is that these particular three layers are the layers that we add so having done that in Keras you basically say okay this is my model and then again there isn't like a concept of like automatically freezing things or an API for that so you just have to loop through the layers that you want to freeze and call dot trainable equals false on them in Keras there's a concept we don't have in FastAI or PyTorch that's sort of compiling a model so basically once your model is ready to use you have to compile it passing in what kind of optimizer to use, what kind of loss to look for and what metrics so again with FastAI you don't have to pass this in because we know what loss is the right loss to use you can always override it but for a particular model we give you good defaults okay so having done all that rather than calling fit you call fit generator in those two generators that you saw earlier the train generator and the validation generator for reasons I don't quite understand Keras expects you to also tell it how many batches there are per epoch so the number of batches is equal to the size of the generator divided by the batch size you can tell it how many epochs just like in FastAI you can say how many processes or how many workers to use for processing unlike FastAI the default in Keras is basically not to use any so to get good speed you're going to make sure you include this and so that's basically enough to start fine tuning the last layers so as you can see I got to a validation accuracy of 95% but as you can also see something really weird happened where after one it was like 49 and then it was 69 and then 95 I don't know why these are so low that's not normal there may be a bug in Keras there may be a bug in my code I reached out on Twitter to see if anybody could figure it out but they couldn't I guess this is one of the challenges with using something like this is one of the reasons I wanted to use FastAI for this course is it's much harder to screw things up if I screwed something up or somebody else did yes you know this is using the TensorFlow back end yeah yeah and if you want to run this to try it out yourself you just can just go pip install TensorFlow-GPU Keras okay because it's not part of the FastAI environment by default but that should be all you need to do to get that working so then there isn't a concept of like layer groups or differential learning rates or partial unfreezing or whatever so you have to decide like I had to print out all of the layers and decide manually how many I wanted to fine-tune so I decided to fine-tune everything from a layer 140 onwards so that's why I just looped through like this after you change that you have to recompile the model and then after that I then ran another step and again I don't know what happened here the accuracy of the training set stayed about the same but the validation set totally fell in the hole but I mean the main thing to note is even if we put aside the validation set we're getting I mean I guess the main thing is there's a hell of a lot more code here which is kind of annoying but also the performance is very different so we're also here even on the training set 97% after 4 epochs that took a total of about 8 minutes you know over here we had 99.5% on the validation set and it ran a lot faster so it was like 4 or 5 minutes right so depending on what you do particularly if you end up wanting to deploy stuff to mobile devices at the moment the kind of pie torch on mobile situation is very early so you may find yourself wanting to use TensorFlow or you may work for a company that's kind of settled on TensorFlow so if you need to convert something like redo something you've learnt here in TensorFlow you probably want to do it with Keras but just recognize you know it's going to take a bit more work to get there and by default it's much harder to get to get the same state-of-the-art results you get with fast AI you'd have to replicate all of the state-of-the-art algorithms that are in fast AI so it's hard to get the same level of results but you can see the basic ideas are similar and it's certainly possible there's nothing I'm doing in fast AI that would be impossible but like you'd have to implement stochastic gradient percent with restarts you'd have to implement differential learning rates you'd have to implement batch norm freezing which you probably don't want to do I know, well that's not quite true I think one person at least on the forum is attempting to create a Keras compatible version of or a TensorFlow compatible version of fast AI which I think, I hope we'll get there I actually spoke to Google about this a few weeks ago and they're very interested in getting fast AI ported to TensorFlow so maybe by the time you're looking at this on the MOOC maybe that will exist I certainly hope so we will see anyway Keras and TensorFlow were certainly not you know that difficult to handle so I don't think you should worry if you're told you have to learn them after this course for some reason it'll only take you a couple of days I'm sure so that's kind of most of the stuff you would need to kind of complete this this kind of assignment from last week which is like try to do everything you've seen already but on the dogebreed dataset just to remind you the kind of last few minutes of last week's lesson I show you how to do much of that including like how I actually explored the data to find out like what the classes were and how big the images were and stuff like that so if you've forgotten that or didn't quite follow it all last week check out the video from last week to see one thing that we didn't talk about is how do you actually submit to Kaggle so how do you actually get predictions so I just wanted to show you that last piece as well and on the wiki thread this week I've already put a little image of this to show you these steps but if you go to the Kaggle website for every competition there's a section called evaluation and they tell you what to submit and so I just copied and pasted these two lines from there and so it says we're expected to submit a file where the first line contains the word ID and then a comma separated list of all of the possible dogebreeds and then every line after that will contain the ID itself followed by all the probabilities of all the different dogebreeds so how do you create that so I recognize that inside our data object there's a dot classes which has got in alphabetical order all of the classes and then so it's got all of the different classes and then inside data dot test data set test ES you can also see there's all the file names so just to remind you dogs and cats dog breeds was not provided in the kind of Keras style format where the dogs and cats are in different folders but instead it was provided as a CSV file of labels so when you get a CSV file of labels you use image classifier data from CSV rather than image classifier data from paths there isn't an equivalent in Keras so you'll see like on the Kaggle forums people share scripts for how to convert it to a Keras style folders but in our case we don't have to we just go image classifier data from CSV passing in that CSV file and so the CSV file will you know has automatically told the data what the classes are and then also we can see from the folder of test images what the file names of those are so with those two pieces of information we're ready to go so I always think it's a good idea to use TTA as you saw with that dogs and cats example just now it can really improve things particularly when your model is less good so I can say learn dot TTA and if you pass in here if you pass in is test equals true then it's going to give you predictions on the test set rather than the validation set okay and now obviously we can't now get an accuracy or anything because by definition we don't know the labels for the test set right so by default most PyTorch models give you back the log of the predictions so then we just have to go X of that to get back our probabilities so in this case the test set had 10,357 images in it and there are 120 possible breeds right so we get back and make tricks of that size and so we now need to turn that into something that looks like this and so the easiest way to do that is with pandas if you're not familiar with pandas there's lots of information online about it or check out the machine learning course in Trader Machine Learning that we have where we do lots of stuff with pandas but basically we can just go pd.dataframe and pass in that matrix and then we can say the names of the columns are equal to data.classes and then finally we can insert a new column at position 0 called id that contains the file names but you'll notice that the file names contain 5 letters at the start we don't want and 4 letters at the end we don't want so I just subset in like so so at that point I've got a dataframe that looks like this which is what we want so you can now call dataframe let's fix it now dataframe okay so you can now call dataframe to csv and quite often you'll find these files actually get quite big so it's a good idea to say compression equals gzip and that'll zip it up on the server for you and that's going to create a zipped up csv file on the server wherever you're running this jupiter notebook so you now need to get that back to your computer so you can upload it or you can use Kaggle CLI so you can type kg submit and do it that way I generally download it to my computer because I like I often like to just like double check it all looks okay so to do that there's a cool little thing called file link and if you run file link with a path on your server it gives you back a URL which you can click on and it'll show you that file from the server onto your computer so if I click on that now I can go ahead and save it and then I can see in my downloads there it is here's my submission file quite open there ya and as you can see it's exactly what I asked for there's my id in the 120 different stock reads my first row containing the file name and the 120 different probabilities okay so then you can go ahead and submit that to Kaggle through their through their regular form and so this is also a good way you can see we've now got a good way of both grabbing any file off the internet and getting a 2RAWS instance or paper space or whatever by using the cool little extension in Chrome and we've also got a way of grabbing stuff off our server easily those of you that are more command line oriented you can also use scp of course but I kind of like doing everything through the notebook alright one other question I had during the week was like what if I want to just get a single a single file that I want to get a prediction for so for example maybe I want to get this first file from my validation set so there's its name so you can always look at a file just by calling image.open that just uses the regular Python imaging library and so what you can do is there's actually I'll show you the shortest version you can just call learn.predict array passing in your your image okay now the image needs to have been transformed so you've seen transforms from model before normally we just put it all in one variable but actually behind the scenes it was returning two things it was returning training transforms and validation transforms so I can actually split them apart and so here you can see I'm actually applying for example my training transforms or probably more likely I want to play validation transforms that gives me back an array containing the transformed image which I can then pass to predict array everything that gets passed to or return from our models is generally assumed to be a minibatch right it's generally assumed to be a bunch of images so we'll talk more about some numpy tricks later but basically in this case we only have one image so we have to turn that into a minibatch of images so in other words we need to create a tensor that basically is not just rows by columns by channels but it's number of image by rows by columns by channels and as one image so it's basically becomes a four-dimensional tensor so there's a cool little trick in numpy that if you index into an array with none that basically adds additional unit access to the start so it turns it from an image into a minibatch of one images and so that's why we had to do that so if you basically find you're trying to do things with a single image with any kind of pie torch or fast ai thing this is just something you might you might find it says like expecting four dimensions only got three probably means that or if you get back a return value from something that has like some weird just access that's probably why it's probably giving you like back a minibatch and so we'll learn a lot more about this but it's just something to be aware of okay so that's kind of everything you need to do in practice so now we're going to kind of get into a little bit of theory what's actually going on behind the scenes with these convolutional neural networks and you might remember back in lesson one we actually saw our first little bit of theory which we stole from this fantastic website sitosa.ioeve explained visually and we learned that a convolution is something where we basically have a little matrix in deep learning nearly always three by three little matrix that we basically multiply every element of that matrix by every element of a three by three section of an image add them all together to get the result of that convolution at one point now let's see how that all gets turned together to create these various layers that we saw in the Zeiler and Burgess paper and to do that again I'm going to steal off somebody who's much smarter than I am we're going to steal from a guy called Atavio Good Atavio Good was the guy who created WordLens which nowadays is part of Google Translate if on Google Translate you've ever like done that thing where you point your camera at something which has any kind of foreign language on it and in real time it overlays it with the translation Atavio's company built that and so Atavio was kind enough to share this fantastic video he created, he's at Google now and I want to kind of step you through it because I think it explains really really well what's going on and then after we look at the video we're going to see how to implement the whole sequence an entire set of layers of convolutional neural network in Microsoft Excel so with your visual learner or a spreadsheet learner hopefully you'll be able to understand all this so we're going to start with an image and something that we're going to do later in the course is we're going to learn to recognize digits and we'll do it like end to end we'll do the whole thing so this is pretty similar so we're going to try and recognize in this case letters so here's an A which obviously it's actually a grid of numbers and so there's the grid of numbers and so what we do is we take our first convolutional filter so we're assuming this is assuming that these are already learned right and you can see this one it's got white down the right hand side right and black down the left so it's like zero zero zero negative one negative one negative one zero zero zero one one one and so we're taking each three by three part of the image and multiplying it by that three by three matrix not as a matrix product but an element wise product and so you can see what happens is everywhere where the white edge is matching the edge of the A if it isn't we're getting green we're getting a positive and everywhere where it's the opposite we're getting a negative we're getting a red and so that's the first filter creating the first the result of the first kernel and so here's a new kernel this one has got a white stripe along the top so we literally scan it through every three by three part of the matrix multiplying those three bits of the A and nine bits of the filter to find out whether it's red or green and how red or green it is and so this is assuming we had two filters one was a bottom edge one was a left edge and you can see here the top edge not surprisingly it's red here sorry bottom edge was red here and green here the right edge red here and green here and then in the next step we add a non-linearity the rectified linear unit which literally means throw away the negatives all gone so here's layer one the input here's layer two the result of two convolutional filters here's layer three which is throw away all of the red stuff and that's called a rectified linear unit and then layer four there's something called a max pull on a layer four we replace every two by two part of this grid and we replace it with its maximum so it basically makes it half the size it's basically the same thing but half the size and then we can go through and do exactly the same thing we can have some new filter three by three filter that we put through each of the two results of the previous layer okay and again we can throw away the red bits right so get rid of all the negatives so we just keep the positives that's called applying a rectified linear unit and that gets us to our next layer of this convolutional neural network so you can see that by you know at this layer back here it was kind of very interpretable it's like we've either got bottom edges or left edges but then the next layer was combining the results of convolution so it's starting to become a lot less clear like intuitively what's happening but it's doing the same thing and then we do another max pull right so we replace every two by two or three by three section with a single digit so here this two by two it's all black so we replaced it with a black and then we go and we take that and we compare it to basically a kind of a a template of what we would expect to see if it was an A if it was a B, if it was a C, if it was a D, if it was an E and we see how closely it matches and we can do it in exactly the same way we can multiply every one of the values in this four by eight matrix with every one of the four by eight in this one and this one and this one and we add, we just add them together to say like how often does it match versus how often does it not match and then that could be converted to give us a percentage probability that this is an A so in this case this particular template matched well with A so notice we're not doing any training here right this is how it would work if we have a pre-trained model right so when we download a pre-trained image net model off the internet and use it on an image without any changing to it this is what's happening or if we take a model that you've trained and you're applying it to some test set or to some new image this is what it's doing right as it's basically taking it through it's applying a convolution to each layer, well multiple convolutional filters to each layer and then during the rectified linear unit so throw away the negatives and then do the max pull and then repeat that a bunch of times and so then we can do it with a new letter A or letter B or whatever and keep going through that process right so as you can see that's far nicer visualization than I could have created because I'm not a Tavio so thanks to him for sharing this with us because it's totally awesome actually this is not done by hand he actually wrote a piece of computer software to actually do these convolutions this is actually being done dynamically which is pretty cool so I'm more of a spreadsheet guy personally I'm a simple person so here is the same thing now in spreadsheet form and so you'll find this in the github repo so you can either get clone the repo to your own computer to open up the spreadsheet or you can just go to github.com slash bustai and click on this it sits inside if you go to our repo and just go to courses as usual go to deal one as usual you'll see there's an excel section there and so here they all are so you can just download them by clicking them or you can clone the whole repo and we're looking at convic example convolution example so you can see I have here an input so in this case the input is the number 7 so I grabbed this from a dataset called mlist which we'll be looking at in a lot of detail and I just took one of those digits at random and I put it into excel and so you can see every pixel is actually just a number between 0 and 1 very often actually it'll be a a byte between 0 and 255 or sometimes it might be a float between 0 and 1 it doesn't really matter by the time it gets to pie torch we're generally dealing with floats so we if one of the steps we often will take will be to convert it to a number between 0 and 1 so you can see I've just used conditional formatting in excel to kind of make the higher numbers more read so you can clearly see that this is a red, this is a 7 but it's just a bunch of numbers that have been imported into excel so here's our input so remember what Atavio did was he then applied two filters with different shapes so here I've created a filter which is designed to detect top edges so this is a 3x3 filter and I've got 1s along the top 0s in the middle, minus 1s at the bottom so let's take a look at an example that's here and so if I hit F2 you can see here highlighted this is the 3x3 part of the input that this particular thing is calculating so here you can see it's got 111 are all being multiplied by 1 and 0.100 are all being multiplied by negative 1 so in other words all the positive bits are getting a lot of positive the negative bits are getting nearly nothing at all so we end up with a high number where else on the other side of this bit of the 7 you can see how this is basically 0s here or perhaps more interestingly on the top of it here we've got high numbers at the top but we've also got high numbers at the bottom which are negating it so you can see that the only place that we end up activating is where we're actually at an edge so in this case this here, this number 3 this is called an activation so when I say an activation I mean ah number ah number that is calculated and it is calculated by taking some numbers from the input and applying some kind of linear operation in this case a convolutional kernel to calculate an output right you'll notice that other than going inputs multiplied by kernel we're doing it together so here's my sum and here's my multiply I then take that and I go max of 0, that and so that's my rectified linear unit so it sounds very fancy rectified linear unit but what they actually mean is open up excel and type equals max 0, thing that's all people in the biz they'll just say value means rectified linear unit means max 0, thing and I'm not like simplifying it, I really need it like when I say, like if I'm simplifying I always say I'm simplifying but if I'm not saying I'm simplifying that's the entirety, okay so a rectified linear unit in its entirety is this and the convolution in its entirety is this okay so a single layer of a convolutional neural network is being implemented in its entirety here in excel okay and so you can see what it's done is it's deleted pretty much the vertical edges and highlighted the horizontal edges so again this is assuming that our network is trained and that at the end of training it created a convolutional filter with these specific nine numbers in it and so here is a second convolutional filter it's just a different nine numbers now PyTorch doesn't store them as two separate nine digit arrays it stores it as a tensor remember a tensor just means an array with more dimensions you can use the word array as well it means the same thing but in PyTorch you always use the word tensor so I'm going to say tensor so it's just a tensor with an additional axis which allows us to stack each of these filters together a filter and kernel pretty much mean the same thing it refers to one of these three by three matrices or one of these three by three slices of a three dimensional tensor so if I take this one and here I've literally just copied the formulas in Excel from above okay and so you can see this one is now finding a vertical edge as we would expect so we've now created one layer right this here is a layer and specifically we'd say it's a hidden layer which is it's not an input layer and it's not an output layer so everything else is a hidden layer okay and this particular hidden layer has is of size two on this dimension because it has two filters right two kernels so what happens next well let's do another one okay as we kind of go along things can multiply a little bit in complexity because my next filter is going to have to contain two of these three by threes because I'm going to have to say how do I want to weight these three things and at the same time how do I want to weight the corresponding three things down here because in PyTorch this whole is going to be stored as a multi-dimensional tensor so you shouldn't really think of this now as two three by three kernels but one two by three by three kernel so to calculate this value here I've got the sum product of all of that plus the sum product of scroll down all of that okay and so the top ones are being multiplied by this part of the kernel and the bottom ones are being multiplied by this part of the kernel and so over time you want to start to get very comfortable with the idea of these like higher dimensional linear combinations like it's harder to draw it on the screen like I had to put one above the other but conceptually just stack it in your mind like this that's really how you want to think and actually Jeffrey Hinton in his original 2012 Coursera class has a tip which is how all computer scientists deal with very high dimensional spaces which is that they basically just visualise the two dimensional space and then say like 12th dimension really fast in their head lots of times so that's it right we can see two dimensions on the screen and then you just got to try to trust that you can have more dimensions like the concepts just you know there's nothing different about them and so you can see in Excel you know Excel doesn't have the ability to handle three dimensional tensors so I had to like say okay take this two dimensional dot product add on this two dimensional dot product but if there was some kind of 3D Excel I could have just done that in a single formula and then again apply max zero comma otherwise known as rectified linear unit otherwise known as real U so here is my second layer and so when people create different architectures and architecture means like how big is your kernel at layer one how many filters are in your kernel at layer one so here I've got a three by three with number one and a three by three there's number two so like this architecture I've created starts off with two three by three convolutional kernels and then my second layer has another two kernels of size two by three by three so there's the first one and then down here here's the second two by three by three kernel and so any one of these numbers is an activation so this activation is being calculated from these three things here and other three things up there and we're using this two by three by three kernel and so what tends to happen is people generally give names to their layers so I say okay let's call this layer here conv one and this layer here and this layer here conv two right so that's generally you'll just see that when you print out a summary of a network every layer will have some kind of name and so then what happens next well part of the architecture is like do you have some max pooling whereabouts does that max pooling happen so in this architecture we're inventing we're going to next step is do max pooling okay max pooling is a little hard to kind of show in Excel but we've got it so max pooling if I do a two by two max pooling it's going to have the resolution both height width so you can see here that I've replaced these four numbers with the maximum of those four numbers right and so because I'm halving the resolution it only makes sense to actually have something every two cells okay so you can see here the way I've got kind of the same looking shape as I had back here okay but it's now half the resolution because I've replaced every two by two with its max and you'll notice like it's not every possible two by two I skip over from here so this is like starting at bq and then the next one starts at bs right so they're like non overlapping that's why it's decreasing the resolution okay so anybody who's comfortable with spreadsheets you know you can open this and have a look and so after our max pooling there's a number of different things we could do next and I'm going to show you a kind of classic old style approach nowadays in fact what generally happens nowadays is we do a max pool where we kind of like max across the entire size right but on older architectures and also on all the structured data stuff we do we actually do something called a fully connected layer and so here's a fully connected layer I'm going to take every single one of these activations and I'm going to give every single one of them a weight and so then I'm going to take over here here is the sum product of every one of the activations by every one of the weights for both of the two levels of my three dimensional tensor right and so this is called a fully connected layer notice it's different to a convolution I'm not going through a few at a time right but I'm creating a really big weight matrix right so rather than having a couple of little three by three kernels my weight matrix is now as big as the entire input and so as you can imagine architectures that make heavy use of fully convolutional layers can have a lot of weights which means they can have trouble with overfitting and they can also be slow and so you're going to see a lot an architecture called VGG because it was the first kind of successful deeper architecture it has up to 19 layers and VGG actually contains a fully connected layer with 4096 weights connected to a hidden layer with 4096 activations connected to a hidden layer with 4096 activations so you've got like 4096 by 4096 multiplied by remember, multiplied by the number of kind of kernels that we've calculated so in VGG there's there's I think it's like 300 million weights of which something like 250 million of them are in these fully connected layers so we'll learn later on in the course about how we can kind of avoid using these big fully connected layers and behind the scenes all the stuff that you've seen us using like ResNet and ResNext none of them use very large fully connected layers Yannette, you had a question Hi So can you tell us more about for example if we had like three channels of the input what would be the shape of these filters So that's a great question so if we had like two channels of input it would look exactly like conv1 conv1 kind of has two channels right and so you can see with conv1 we had two channels so therefore our filters had to have like two channels per filter and so you could like imagine that this input didn't exist you know and actually this was the input so when you have a multi-channel input and so images often full color they have three red, green and blue sometimes they also have an alpha channel so however many you have that's how many inputs you need and so something which I know Yannette was playing with recently was like using a full color image net model in medical imaging for something called Vonage calculations which has a single channel and so what she did was basically take the input the single channel input and make three copies of it so you end up with basically like one, two, three versions of the same thing which is like it's kind of it's not ideal like it's kind of redundant information that we don't quite want but it does mean that then if you had something that expected a three channel convolutional filter you can use it and so at the moment there's a Kaggle competition for iceberg detection using some funky satellite specific data format that has two channels so here's how you could do that you could either copy one of those two channels into the third channel or I think what people on Kaggle are doing is to take the average of the two again it's not ideal but it's a way that you can use pre-trained networks yeah I've done a lot of fiddling around like that you can also actually I've actually done things where I wanted to use a three channel image network on four channel data I had satellite data where the fourth channel was near infrared and so basically I added an extra kind of level to my convolutional kernels that were all zeros and so basically like started off by ignoring the new infrared band and so what happens it basically and you'll see this next week is that rather than having these like carefully trained filters when you're actually training something from scratch we're actually going to start with random numbers that's actually what we do we actually start with random numbers and then we use this thing called stochastic gradient descent which we've kind of seen conceptually to slightly improve those random numbers to make them less random and basically do that again and again and again okay great let's take a seven minute break and we'll come back at 7.50 alright so what happens next so we've got as far as doing a fully connected layer right so we had our the results of our max pooling layer got fed to a fully connected layer and you might notice those of you that remember your linear algebra the fully connected layer is actually doing a classic traditional matrix product okay so it's basically just going through each pair in turn multiplying them together and then adding them up to do a matrix product now in practice if we want to calculate which one of the 10 digits we're looking at this single number we've calculated isn't enough we would actually calculate 10 numbers so what we would have is rather than just having one set of fully connected weights like this and I say set because remember there's like a whole 3D kind of tensor of them we would actually need 10 of those right so you can see that these tensors start to get a little bit high dimensional right and so this is where my patients were doing an Excel ran out but imagine that I had done this 10 times I could now have 10 different numbers all being calculated here using exactly the same process right it would just be 10 of these fully connected 2 by n by n arrays basically and so then we would have 10 numbers being sped out so what happens next so next up we can open up a different Excel worksheet entropy example Excel s that's got 2 different worksheets one of them is called softmax and what happens here I'm sorry I've changed domains rather than predicting whether it's a number from 1 not to 9 I'm going to predict whether something is a cat a dog a plane a fish or a building okay so out of our that fully connected layer we've got in this case we'd have 5 numbers and notice at this point there's no value okay in the last layer there's no value okay so I can have negatives so I want to turn these 5 numbers each into a probability I want to turn it into a probability from 0 to 1 that it's a cat, that it's a dog, that it's a plane, that it's a fish that it's a building and I want those probabilities to have a couple of characteristics first is that each of them should be between 0 and 1 and the second is that they together should add up to 1 right it's definitely one of these 5 things okay so to do that we use a different kind of activation function what's an activation function an activation function is a function that is applied to activations so for example max 0, something is a function that I applied to an activation so an activation function always takes in 1 number and spits out 1 number so max of 0,x takes in a number x and spits out some different number value of x that's all an activation function is and if you remember back to that power point we saw in lesson 1 each of our layers was just a linear function and then after every layer we said we needed some non-linearity right because if you stack a bunch of linear layers together then all you end up with is a linear layer right so somebody's talking not a slightly distracting thank you if you stack a number of linear functions together you just end up with a linear function and nobody does any cool deep learning with just linear functions but remember we also learnt that by stacking linear functions between each one a non-linearity we could create like arbitrarily complex shapes and so the non-linearity that we're using after every hidden layer is a value rectified linear unit a non-linearity is an activation function an activation function is a non-linearity within deep learning obviously there's lots of other non-linearity's in the world but in deep learning this is what we mean so an activation function is any function that takes some activation in that's a single number and spits out some new activation like max of zero comma so I'm now going to tell you about a different activation function it's slightly more complicated than value but not too much it's called softmax softmax only ever occurs at the very end and the reason why is that softmax always spits out numbers is an activation function that always spits out a number between 0 and 1 and it always spits out a bunch of numbers that add to 1 so a softmax gives us what we want right in theory this isn't strictly necessary like we could ask our neural net to learn a set of kernels which have which give probabilities that line up as closely as possible with what we want but in general with deep learning if you can construct your architecture so that the desired characteristics are as easy to express as possible you'll end up with better models like they'll learn more quickly with less parameters so in this case we know that our probabilities should end up being between 0 and 1 and we'll end up adding to 1 so if we construct an activation function which always has those features then we're going to make our neural network do a better job it's going to make it easier for it it doesn't have to learn to do those things because it'll happen automatically so in order to make this work we first of all have to get rid of all of the negatives like we can't have negative probabilities so to make things not be negative, one way we could do it is just go E to the power of so here you can see my first step is to go X of the previous one and I think I've mentioned this before but of all the math that you just need to be super familiar with to do deep learning the one you really need is logarithms and Xs all of deep learning and all of machine learning they appear all the time so for example you absolutely need to know that log of X times Y equals log of X plus log of Y and not just know that that's a formula that exists but have a sense of what does that mean why is that interesting I can turn multiplications into additions that could be really handy and therefore log of X over Y equals log of X minus log of Y again that's going to come in pretty handy rather than dividing I can just subtract things and also remember that if I've got log of X equals Y then that means E to the Y equals X so it's log log and E to the the inverse of each other okay again you just you need to really really understand these things and like so if you haven't spent much time with logs and Xs for a while try plotting them in Excel or in Jupyter Notebook have a sense of what shape they are how they combine together just make sure you're really comfortable with them so we're using it here right we're using it here so one of the things that we know is E to the power of something is positive okay so that's great the other thing you'll notice about E to the power of something is because it's a power numbers that are slightly bigger than other numbers like 4 is a little bit bigger than 2.8 when you go E to the power of it really accentuates that difference okay so we're going to take advantage of both of these features for the purpose of deep learning okay so we take the results of this fully connected layer we go E to the power of for each of them and then we're going to um there we go and then we're going to add them up okay so here is the sum of E to the power of so then here we're going to take E to the power of divided by the sum of E to the power of so if you take all of these things divided by their sum then by definition all of those things must add up to 1 and furthermore since we're dividing by their sum they must always vary between 0 and 1 because they are always positive alright and that's it so that's what softmax is okay so I've got this kind of doing random numbers each time right and so you can see like as I loop through my softmax generally has quite a few things that are so close to 0 that they round down to 0 and you know maybe one thing that's nearly 1 and the reason for that is what we just talked about that is with the X just having one number a bit bigger than the others tends to like push it out further right so even though my inputs here are random numbers between negative 5 and 5 my outputs from the softmax don't really look that random at all in the sense that they tend to have one big number and a bunch of small numbers and now that's what we want we want to say like in terms of like is this a cat, a dog, a plane, a fish or a building we really want it to say like it's that it's a dog or it's a plane not like I don't know so softmax has lots of these cool properties it's going to return a probability that adds up to 1 and it's going to tend to want to pick one thing particularly strongly so that's softmax Yannette could you pass to me how would we do something that has you have an image and you want to kind of do that and the dog or like has multiple things what kind of function would we try to use so happens we're going to do that right now so so how do we think about why we might want to do that and so one reason we might want to do that is to do multi-label classification so we're looking now at less than 2 image models and specifically we're going to take a look at the planet competition satellite imaging competition now the satellite imaging competition has some similarities to stuff we've seen before right so before we've seen a cat versus dog and these images are a cat or a dog they're not neither they're not both right but the satellite imaging competition has images that look like this and in fact every single one of the images is classified by weather there's four kinds of weather one of which is haze and another of which is clear in addition to which there is a list of features that may be present including agriculture which is like some cleared area used for agriculture primary which means primary rainforest and water which means a river or a creek so here is a clear day satellite image showing some agriculture some primary rainforest and some water features and here's one which is in haze and is entirely primary rainforest so in this case we're going to want to be able to show we're going to be able to predict multiple things and so softmax wouldn't be good because softmax doesn't like predicting multiple things and like I would definitely recommend anthropomorphizing your activation functions they have personalities and the personality of the softmax is it wants to pick a thing and people forget this all the time I've seen many people even well regarded researchers in famous academic papers using like softmax for multi-label classification it happens all the time and it's kind of ridiculous because they're not understanding the personality of their activation function so for multi-label classification where each sample can belong to one or more classes we have to change a few things but here's the good news in fastai we don't have to change anything right so fastai will look at the labels in the csv and if there is more than one label ever for any it will automatically switch into like multi-label mode so I'm going to show you how it works behind the scenes but the good news is you don't actually have to care it happens anyway so if you have multi-label images multi-label objects you obviously can't use the classic keras style approach where things are in folders because something can conveniently be in multiple folders at the same time right so that's why you basically have to use the from csv approach so if we look at an example actually I'll show you I'll take you through it so we can say this is the csv file containing our labels this looks exactly the same as it did before but rather than side on top down and top down I've mentioned before that it can do a vertical flip so it actually does more than that there's actually eight possible symmetries for a square which is it can be rotated through 90, 180, 270 or 0 degrees and for each of those it can be flipped and if you think about it for a while you'll realize that that's a complete enumeration of everything that you can do in terms of symmetries to a square it's called the dihedral group of eight so if you see in the code there's actually a transform called dihedral that's why it's called that so this transforms will basically do the full set of eight symmetric dihedral rotations and flips plus everything which we can do to dogs and cats small 10 degree rotations little bit of zooming little bit of contrast and brightness adjustment so these images are of size 256 by 256 so I just created a little function here to let me quickly grab you know a data loader of any size so here's a 256 by 256 once you've got a data object inside it we've already seen that there's things called val ds, test ds train ds, they're things that you can just index into and grab a particular image so you can just use square brackets zero you'll also see that all of those things have a dl that's a data loader so ds is data set dl is data loader these are concepts from PyTorch so if you google PyTorch data set or PyTorch data loader you can basically see what it means but the basic idea is a data set gives you a single image or a single object back, a data loader gives you back a mini batch and specifically it gives you back a transformed mini batch so that's why when we create our data object we can pass in num workers and transforms like how many processes do you want to use what transforms do you want and so with a data loader you can't ask for an individual image you can only get back a mini batch and you can't get back a particular mini batch you can only get back the next mini batch so something we're just going to loop through grabbing a mini batch at a time and so in Python the thing that does that is called a generator or an iterator there's slightly different versions of the same thing so to turn a data loader into an iterator you use the standard Python function called ida that's a Python function just a regular part of the Python basic language that returns to an iterator and an iterator is something that takes you can pass it to the standard Python function or statement next and that just says give me another batch from this iterator so we're basically, this is one of the things I really like about PyTorch is it really leverages modern pythons kind of stuff in TensorFlow they invent their whole new world of ways of doing things and so it's kind of more the sense it's more like cross-platform but another sense like it's not a good fit to any platform so it's nice if you know Python well PyTorch comes very naturally if you don't know Python well PyTorch is a good reason to learn Python well a PyTorch module, neural network module is a standard Python bus for example so any work you put into learning Python better you can work with PyTorch so here I am using standard Python iterators and next to grab my next mini batch from the validation sets data loader and that's going to return two things it's going to return the images in the mini batch and the labels in the mini batch so standard Python approach I can pull them apart like so and so here is one mini batch of labels since I said that my batch size let's go back and find it actually it's the batch size by default is 64 so I didn't pass in a batch size so just remember shift tab to see like what are the things you can pass and what are the defaults so by default my batch size is 64 so I've got back something of size 64 by 17 so there are 17 of the possible classes so let's take a look at the zero set of labels so the zero images labels so I can zip again standard Python thing zip takes two lists and combines it so you get the zero thing from the first list the zero thing from the second list the first thing from the first first thing from the second list and so forth so I can zip them together I can find out for the zeroth image in the validation set it's agriculture it's clear it's primary rainforest it's slash and burn it's water so as you can see here this is a multi label here's a way to do multi label classification so by the same token if we go back to our single label classification it's a cat dog playing fish or building behind the scenes we haven't actually looked at it but behind the scenes fastai and PyTorch are turning our labels into something called one hot encoded labels and so if it was actually a dog then the actual values would be like that so these are like the actuals okay so do you remember at the very end of Atavio's video he showed how like the template had to match to one of the like 5 ABCDE templates and so what it's actually doing is it's comparing when I said it's basically doing a dot product it's actually a fully connected layer at the end that calculates an output activation that goes through a soft max and then the soft max is compared to the one hot encoded label right so if it was a dog there would be a one here and then we take the difference between the actuals and the soft max activations to say and add up those differences to say how much error is there essentially we're skipping over something called a loss function that we'll learn about next week but essentially we're basically doing that now if it's one hot encoded like if there's only one thing which have a one in it then actually storing it as 0 1 0 0 0 is terribly inefficient right like we could basically say what are the index of each of these things right so we can say it's like 0 1 2 3 4 like so right and so rather than storing it as 0 1 0 0 0 we actually just store the index value right so if you look at the the y values for the cats and dogs competition or the dog breeds competition you won't actually see a big list of 1s and 0s like this you'll see a single integer right which is like what class index is it right and internally inside py torch it will actually turn that into a one hot encoded vector but like you will literally never see it okay and and py torch has different loss functions where you basically say this thing's one this thing is one hot encoded or this thing is not and it uses different loss functions that's all hidden by the fast AI library right so like you don't have to worry about it but the cool thing to realize is that this approach for multi-label encoding with these ones and zeros behind the scenes the exact same thing happens for single-label classification does it make sense to change the pickiness of the sigmoid of the softmax function by changing the base no because when you change the more math log base A of B equals log B over log A so changing the base is just a linear scaling and linear scaling is something which the neural net can learn very easily good question okay so here is that image right here is the image with slash and burn water etc etc one of the things to notice here is like when I first displayed this image it was so washed out I really couldn't see it but remember images now you know we know images are just matrices of numbers and so you can see here I just said times 1.4 just to make it more visible right so like now that you kind of it's the kind of thing I want you to get familiar with is the idea that this stuff you're dealing with they're just matrices of numbers and then you can fiddle around with them so if you're looking at something and you're like oh it's a bit washed out you can just multiply it by something to brighten it up a bit okay so here we can see I guess this is the slash and burn here's the river that's the water here's the primary rainforest maybe that's the agriculture and so forth okay so so you know with all that background how do we actually use this exactly the same way as everything we've done before right so you know size and the interesting thing about playing around with this planet competition is that these images are not at all like image nets and I would guess that the vast majority of the stuff that the vast majority of you do involving convolutional neural nets won't actually be anything like image net you know it'll be it'll be medical imaging or it'll be like classifying different kinds of steel tube or figuring out whether a world you know is going to break or not or looking at satellite images or you know whatever right so it's good to experiment with stuff like this planet competition to get a sense of kind of what you want to do and so you'll see here I start out by resizing my data to 64 by 64 it starts out at 256 by 256 right now I wouldn't want to do this for the cats and dogs competition because the cats and dog competition we start with a pre-trained image net network it's nearly it starts off nearly perfect right so if we resized everything to 64 by 64 and then retrained the whole set we basically destroy the weights that are already pre-trained to be very good remember image net most image net models are trained at either 224 by 224 or 299 by 299 right so if we like retrain them at 64 by 64 we're going to kill it on the other hand there's nothing in the image net that looks at anything like this there's no satellite images so the only useful bits of the image net network for us are kind of layers like this one finding edges and gradients and this one finding kind of textures and repeating patterns and maybe these ones of finding more complex textures but that's probably about it right so in other words starting out by training very small images works pretty well when you're using stuff like satellites so in this case I started right back at 64 by 64 grabbed some data built my model found out what learning rate to use interestingly it turned out to be quite high it seems that because it's so unlike image net I needed to do quite a bit more fitting of just that last layer before it started to flatten out then I unfreezed it there's a difference to image net like data sets is my learning rate in the initial layer I set to divided by 9 the middle layer I set to divided by 3 where else for stuff like image net I had a multiple of 10 for each of those again the idea being that the earlier layers probably are not as close to what they need to be compared to the image net like data sets so again unfreeze train for a while and you can kind of see here there's cycle 1 there's cycle 2, there's cycle 3 and then I kind of increased double the size of my images fit for a while unfreeze fit for a while double the size of the images again fit for a while unfreeze fit for a while and then add TTA and we looked at this this process ends up getting us about 30th place in this competition which is really cool because a lot of very very smart people just a few months ago worked very very hard on this competition couple of things people have asked about one is what does this data.resize do so a couple of different pieces here the first is that when we say back here what transforms do we apply and here's our transforms we actually pass in a size so one of the things that data loaded does is to resize the images on demand every time it sees them this has got nothing to do with that.resize method so this is the thing that happens at the end like whatever's passed in before it hits out before our data loader spits it out it's going to resize it to this size if the initial input is like a thousand by a thousand reading that jpeg and resizing it to 64 by 64 turns out to actually take more time than training the conf net darts for each batch so basically all resize does is it says hey I'm not going to be using any images bigger than size times 1.3 so just go through once and create new jpegs of this size and they're rectangular so new jpegs were the smallest edges of this size and again it's like you don't have to do this there's no reason to ever use it if you don't want to it's just to speed up but if you've got really big images coming in it saves you a lot of time and you'll often see on like Kaggle kernels or forum posts or whatever people will have like bash scripts or stuff like that to loop through and resize images to save time you never have to do that just you can just say dot resize and it'll just you know once off it'll go through and create that if it's already there it'll use the resized ones for you okay so it's just a speed up convenience function no more okay so for those of you that are kind of past dog breeds I would be looking at planet next you know like try like play around with with trying to get a sense of like how can you get this as an accurate model one thing to mention and I'm not really going to go into it in detail there's nothing to do with deep learning particularly is that I'm using a different metric I didn't use metrics equals accuracy but I said metrics equals F2 just remember from last week that confusion matrix that like two by two you know correct incorrect for each of dogs and cats there's a lot of different ways you could turn that confusion metrics into a score do you care more about false negatives or do you care more about false positives and how do you weight them and how do you combine them together right there's a base there's basically a function called F beta where the beta says how much do you weight false negatives versus false positives and so F2 is F beta with beta equals two and it's basically a particular way of weighting false negatives and false positives and the reason we use it is because cattle told us that planet who are running this competition wanted to use this particular F beta metric the important thing for you to know is that you can create custom metrics so in this case you can see here it says from planet import F2 and really I've got this here so that you can see how to do it right so if you look inside courses DL1 you can see there's something called planet.py right and so if I look at planet.py you'll see there's a function there called F2 and so F2 simply calls F beta score from F2.py where it came from and does a couple little tweets that are particularly important but the important thing is like you can write any metric you like as long as it takes in set of predictions and a set of targets they're both going to be numpy arrays one dimensional numpy arrays and then you return back a number and so as long as you create a function that takes two vectors and returns art number you can call it as a metric and so then when we said um let's see here learn metrics equals and then passed in that array which just contains a single function F2 then it's just going to be printed out after every epoch for you so in general like the fast AI library everything is customizable so kind of the idea is that everything is everything is kind of gives you what you might want by default but also everything can be changed as well yes you know we have a little bit of confusion about the difference between multi label and just single label do you have by any chance an example in which you compute like similarly to the example that you just show us ah I didn't get to that activation function yeah so so I'm so sorry I said I'll do that and then I didn't so the output activation function for single label classification is softmax for all the reasons that we talked about um but if we were trying to predict something that was like 00110 then softmax would be a terrible choice because it's very hard to come up with something where both of these are high in fact it's impossible because they have to add up to one so the closest they could be would be 0.5 so for multi label classification our activation function is called sigmoid okay and again the fast AI library does this automatically for you if it notices you have a multi label problem and it does that by checking your data set to see if anything has more than one label applied to it um and so sigmoid is a function which is equal to it's basically the same thing except rather than we never add up all of these x's but instead we just take this x and we say it's just equal to it divided by 1 plus it and so the nice thing about that is that now like multiple things can be high at once right um and so generally then if something is less than 0 it's sigmoid is going to be less than 0.5 if it's greater than 0 it's sigmoid is going to be greater than 0.5 um and so the important thing to know about a sigmoid function is that its shape um is something which asymptotes the top to 1 and asymptotes I drew that asymptotes at the bottom to 0 and so therefore it's a good thing to model a probability with um anybody who has done any logistic regression um we'll be familiar with this it's what we do in logistic regression um so it kind of appears everywhere in machine learning and you'll see that kind of a sigmoid and a softmax they're very close to each other kind of conceptually um but this is what we want is our activation function for multi-label and this is what we want for single-label and again fast ai does it all for you there was a question over here yes um I have a question about um the initial training that you do if I understand correctly you have uh we have frozen the um the pre-trained model and you only initially try to train the latest uh layer right right um but um from the other hand we said that only the initial layer so let's let's probably the first layer is like important to us and the other two are more like features that are image not related and we can apply in this case well that they um the layers are very important but the pre-trained weights in them aren't so it's the later layers that we really want to train the most the earlier layers um likely to be like already closer to what we want okay so you start with the latest one and then you go right so if you go back to our quick dogs right um when we create a model from pre-trained from a pre-trained model it returns something where all of the convolutional layers are frozen and some randomly set uh fully connected layers we add to the end are unfrozen and so when we go fit at first it just trains the randomly set a randomly initialized fully connected layers right um and if something is like really close to image net that's often all we need right because the other layers are already good at finding edges gradients repeating patterns for ears and dog's heads you know um so then when we unfreeze we set the learning rates for the early layers to be really low because we don't want to change them much whereas the later ones we set them to be higher where else for satellite data right this is no longer true you know the early layers are still like better than the later layers but we still probably need to change them quite a bit so that's why this learning rate is nine times smaller than the final learning rate rather than a thousand times smaller than the final learning rate okay so you play with the weights of the layers with the learning rates of most of the stuff you see online if they talk about this at all they'll talk about unfreezing different subsets of layers and indeed we do unfreeze our randomly generated ones but what I've found is although the Fast AI Library you can type learn.freeze2 and just freeze a subset of layers this approach of using differential learning rates seems to be like more flexible to the point that I never find myself unfreezing subsets of layers but what I didn't understand is that I would expect you to start with that with the differential the different learning rates rather than trying to learn the last layer so you could skip this training just the last layers and just goes straight to differential learning rates but you probably don't want to the reason you probably don't want to is that there's a difference the convolutional layers all contain pre-trained weights so they're not random for things that are close to image net they're actually really good for things that are not close to image net they're better than nothing all of our fully connected layers however are totally random so therefore you would always want to make the fully connected weights better than random by training them a bit first because otherwise if you go straight to unfreeze then you're actually going to be like fiddling around with those early layer weights when the later ones are still random that's probably not what you want I think there's another question here so when we unfreeze what other things we are trying to change there will it change the kernels themselves or that's always what SGD does yeah so the only thing what training means is setting these numbers right and these numbers and these numbers the weights so the weights are the weights of the fully connected layers and the weights in those kernels and the convolution so that's what training means and we'll learn about how to do it with SGD but training literally is setting those numbers these numbers on the other hand are activations they're calculated from the weights and the previous layers activations or inputs I have a question so can you lift that up higher and speak about it so in your example of training the cellular image example so you start with a very small size like 64 so does it literally mean that the model takes a small area from the entire image that is 64 by 64 so how do we get that 64 by 64 depends on the transforms by default our transform takes the smallest edge and resites zooms the whole thing out resamples it so the smallest edge is the size 64 and then it takes the center crop of that although when we're using data augmentation it actually takes a randomly chosen crop but in a case where an image has multiple objects would it be possible you would just lose the other things which is why data augmentation is important and particularly their test time augmentation is going to be particularly important there may be an artisanal mine out in the corner in the center crop you don't see so data augmentation becomes very important here I actually have one other question so when we talk about metrics either it's accuracy or F2 that's not really what the model tries to ultimate that's a great point, that's not the loss function the loss function is something we'll be learning about next week and it uses cross entropy or otherwise known as negative flow of likelihood the metric is just the thing that's printed so we can see what's going on just next year so in the context of multi-class modeling does our training data also have to be multi-class or can I train on just images of pure cats and pure dogs and expect it at prediction time to predict if I give it a picture of both having cat and a dog I've never tried that and I've never seen an example of something that needed it I guess conceptually there's no reason it wouldn't work but it's kind of out there and you still use a sigmoid activation you would have to make sure you're using a sigmoid loss function so in this case fast ai's default would not work because by default fast ai would say your training data has both a cat and a dog so you would have to overwrite the loss function thanks we use the differential learning rates those three learning rates do they just kind of spread evenly across the layers yeah we'll talk more about this later in the course but in the fast ai library there's a concept of layer groups so in something like a ResNet 50 you know there's hundreds of layers and I figured you don't want to write down hundreds of learning rates so I've basically decided for you how to split them and the last one always refers just to the fully connected layers that we've randomly initialized and added to the end and then these ones are split generally about halfway through basically I've tried to make it so that these ones are kind of the ones which you hardly want to change at all and these are the ones you might want to change a little bit I don't think we'll cover it in the course but if you're interested we can talk about it in the forum there are ways you can overwrite this behavior to define your own layer groups if you want to and is there any way to visualize the model easily or like dump the layers of the model yeah absolutely you can let's make sure we've got one here okay so if you just type learn it doesn't tell you much at all what you can do is go learn.summary and that spits out basically everything there's all the layers and so you can see in this case these are the names I mentioned how they all got names right so the first layer is called conv2d-1 and it's going to take as input this is useful to actually look at it's taking 64 by 64 images which is what we told it we're going to transform things too this is three channels PyTorch most things have channels at the end say 64 by 64 by 3 PyTorch moves it to the front so it's 3 by 64 by 64 that's because it turns out that some of the GPU computations run faster when it's in that order but that happens all behind the scenes automatically so part of that transformation stuff that's all done automatically is to do that minus 1 means however big the batch size is in Keras they use a special number none in PyTorch they use minus 1 so this is a 4 dimensional mini batch the number of images in the mini batch is dynamic you can change that the number of channels is 3 the number of images is 64 by 64 okay and so then you can basically see that this particular convolutional kernel apparently has 64 kernels in it and it's also halving we haven't talked about this but convolutions can have something called a stride it's like max pooling the changes the size so it's returning a 32 by 32 by 64 kernel tensor and so on and so forth that's summary and we'll learn all about what that's doing in detail in the second half of the course one more I collected my own data set and I try to use the and it's a really small data set these currencies from the cool images and I tried to do a learning rate find and then the plot and it just it gave me some numbers which I didn't understand on the learning rate font yeah and then the plot was empty so yeah I mean let's talk about that on the forum but basically the learning rate finder is going to go through a mini batch at a time if you've got a tiny data set there's just not enough mini batches so the trick is to make your mini batch make your batch size really small like try making it like four or eight or something okay they were great questions nothing online to add in it they were great questions we've got a little bit past where I hope to but let's quickly talk about structured data so we can start thinking about it for next week this is really weird right to me there's basically two types of data set we use in machine learning there's a type of data like audio images natural language text where all of the all of the things inside an object like all of the pixels inside an image are all the same kind of thing they're all pixels or they're all amplitudes of a waveform or they're all words I call this kind of data unstructured and then there's data sets like a profit and loss statement or the information about a Facebook user where each column is like structurally quite different you know one thing is representing like how many page views last month another one is their sex another one is what zip code they're in and I call this structured data that particular terminology is not unusual like lots of people use that terminology but lots of people don't there's no particularly agreed upon terminology so when I say structured data I'm referring to kind of columnar data as you might find in a way through a spreadsheet where different columns represent different kinds of things and each row represents an observation and so structured data is probably what most of you are analyzing most of the time funnily enough you know academics in the deep learning world don't really give a shit about structured data because it's pretty hard to get published in fancy conference proceedings if you've got a better logistics model you know it's the thing that makes the world goes round it's the thing that makes everybody money and efficiency and makes stuff work but it's largely ignored sadly so we're not going to ignore it because we're practical deep learning and Kaggle doesn't ignore it either because people put prize money up on Kaggle to solve real world problems so there are some great Kaggle competitions we can look at there's one running right now which is the grocery sales forecasting competition for Ecuador's largest chain it's always a little I've got to be a little careful about how much I show you about currently running competitions because I don't want to help you cheat but it so happens there was a competition a year or two ago for one of Germany's largest grocery chains which is almost identical so I'm going to show you how to do that so that was called the Rossman's Doors data and so I would suggest you know first of all try practicing what we're learning on Rossman right but then see if you can get it working on grocery because currently on the leaderboard no one seems to basically know what they're doing in the groceries competition if you look at the leaderboard let's see here we are these ones around 529530 are people that are literally finding group averages and submitting those I know because they're kernels that they're using so you know basically the people around 20th place are not actually doing any machine learning so yeah let's see if we can improve things so you'll see there's Lesson 3 Rossman in the book make sure you get pool in fact you know just reminder before you start working get pool in your fast AI repo and from time to time conda and update for you guys during the in-person course the conda and update you should do it more often because we're changing things a little bit folks in the MOOC more like once a month should be fine so anyway I just changed this a little bit so make sure you get pool to get Lesson 3 Rossman and there's a couple of new libraries here one is fastai.structured fastai.structured contains stuff which is actually not at all PyTorch specific we actually use that in the machine learning course as well for doing random forests with no PyTorch at all I mentioned that because you can use that particular library without any of the other parts of fastai so that can be handy and then we're also going to use fastai.column data which is basically some stuff that allows us to do fastai.py.torch stuff with columnar structured data for structured data we need to use pandas a lot anybody who's used our data frames will be very familiar with pandas pandas is basically an attempt to kind of replicate our data frames in python you know and a bit more if you're not entirely familiar with pandas there's a great book which I think I might have mentioned before Python for Data Analysis by Wes McKinney there's a new edition that just came out a couple of weeks ago obviously being by the pandas author its coverage of pandas is excellent it also covers numpy scipy atplotlib scikitlearn ipython and jupyter really well and so I'm kind of going to assume that you know your way around these libraries to some extent also there was the workshop we did before this started and there's a video of that online where we kind of have a brief mention of all of those tools structured data is generally shared as CSV files there was no difference in this competition as you'll see there's a hyperlink to the Rossman data set here now if you look at the bottom of my screen you'll see this goes to files.pass.ai because this doesn't require any login or anything to grab this data set it's as simple as right clicking copy link address head over to where you want it and just type wget and the URL so that's because you know it's not behind a login or anything so you can grab it from there and you can always read a CSV file with just pandas.readcsb now in this particular case there's a lot of preprocessing that we do and what I've actually done here is I've I've actually stolen the entire pipeline from the third place winner of Rossman so they made all their data they're really great you know they better get hub available with everything that we need and I've ported it all across and simplified it and tried to make it pretty easy to understand this course is about deep learning not about data processing so I'm not going to go through it but we will be going through it in the machine learning course in some detail because feature engineering is really important so if you're interested check out the machine learning course for that I will however show you kind of what it looks like so once we read the CSVs in you can see basically what's there so the key one is for a particular store we have the let's have a look we have the date and we have the sales for that particular store we know whether that thing is on promo or not we know the number of customers that that particular store had we know whether that date was a school holiday we also know what kind of store it is so this is pretty common you'll often get data sets where there's some column with just some kind of code we don't really know what the code means most of the time I find it doesn't matter what it means normally you get given a data dictionary when you start on a project you're working on an internal project you can ask the people at your company what does this column mean I kind of stay away from learning too much about it I prefer to see what the data says first there's something about what kind of product are we selling in this particular row and then there's information about like how far away is the nearest competitor how long have they been open for how long has the promo been on for for each store we can find out what state it's in for each state we can find out the name of the state this is in Germany and interestingly they were allowed to download any external data they wanted in this competition it's very common as long as you share it with everybody else and so some folks tried downloading data from Google Trends I'm not sure exactly what it was that they were checking the trend of but we have this information from Google Trends somebody downloaded the weather for every day in Germany for every state and yeah that's about it right so you can get a data frame summary with pandas which kind of lets you see how many observations and means and standard deviations again I don't do a hell of a lot with that early on but it's nice to know it there so what we do you know this is called a relational data set a relational data set is one where there's quite a few tables we have to join together it's very easy to do that in pandas there's a thing called merge so a great little function to do that and so I just started joining everything together I joined in the weather the Google Trends the stores yeah that's about everything I guess you'll see there's one thing that I'm using from the fast.ai library which is called add date part we talk about this a lot in the machine learning course but basically this is going to take a date and pull out of it a bunch of columns day of week is at the start of a quarter month of year so on and so forth and add them all to the data set okay so this is all standard preprocessing alright so we join everything together we fiddle around with some of the dates a little bit some of them are in month and year format we turn it into date format we spend a lot of time trying to take information about for example holidays and add a column for like how long until the next holiday how long has it been since the last holiday did it for promos so on and so forth okay so we do all that and at the very end we basically save a big structured data file that contains all that stuff something that those of you that use pandas may not be aware of is that there's a very cool new format called feather which you can save a pandas data frame into this feather format it's kind of pretty much takes it as it sits in RAM and dumps it to the disk and so it's like really really really fast the reason that you need to know this is because the Ecuadorian grocery competition it's on now has 350 million records so you will care about how long things take it took I believe about six seconds for me to save 350 million records to feather format so that's pretty cool so at the end of that I'd save it as feather format and for the rest of this discussion I'm just going to take it as given that we've got this nicely preprocessed feature engineered file and I can just go read better okay but for you to play along at home you will have to run those previous cells oh except the see these ones I've commented out you don't have to run those because the file that you download from files.fast.ai has already done that for you okay alright so we basically have all these columns so it basically is going to tell us you know how many of this thing was sold on this date at this store and so the goal of this competition is to find out how many things will be sold for each store for each type of thing in the future okay and so that's basically what we're going to be trying to do and so here's an example of what some of the data looks like and so next week we're going to see how to go through these steps but basically what we're going to learn is we're going to learn to split the columns into two types some columns we're going to treat as categorical which is to say store ID 1 and store ID 2 are not numerically related to each other the categories right we're going to treat day of week like that too Monday and Tuesday day 0 and day 1 not numerically related to each other where else distance in kilometers to the nearest competitor that's a number that we're going to treat numerically right so in other words the categorical variables we basically are going to one-hot encode them you can think of it as one-hot encoding them where else the continuous variables we're going to be feeding into fully connected layers just as is okay so what we'll be doing is we'll be basically creating a validation set and you'll see like a lot of these will start to look familiar this is the same function we used on planet and dog breeds to create a validation set there's some stuff that you haven't seen before where we're going to basically rather than saying image data dot from csv we're going to say columnar data from data frame you'll see like the basic API concepts will be the same but they're a little different but just like before we're going to get a learner and we're going to go lrfind to find our best learning rate and then we're going to go dot fit with a metric with a cycle length so the basic sequence it's going to end up looking hopefully very familiar so we're out of time so what I suggest you do this week is like try to enter as many caggle image competitions as possible try to really get this feel for like cycle lengths learning rates plotting things that post I showed you at the start of class today lesson one really go through that as many image data sets as you can to just feel really comfortable with it because you want to get to the point where next week when we start talking about structured data that this idea of like how learners kind of work and data works and data loaders and data sets and looking at pictures should be really you know intuitive alright good luck see you next week