 All right, welcome to lesson six, where we're going to do a deep dive into computer vision, convolutional neural networks, what is a convolution, and we're also going to learn the final regularization tricks after last lesson learning about weight decay and slash L2 regularization. I want to start by showing you something that I'm really excited about, and I've had a small hand in helping to create. For those of you that saw my talk on TED.com, you might have noticed this really interesting demo that we did about four years ago showing a way to quickly build models with unlabeled data. It's been four years, but we're finally at a point where we're ready to put this out in the world and let people use it. And the first people we're going to let use it are you folks. So the company is called platform.ai, and the reason I'm mentioning it here is that it's going to let you create models on different types of data sets to what you can do now. That is to say, data sets that you don't have labels for yet. We're actually going to help you label them. So this is the first time this has been shown before. So I'm pretty thrilled about it. And let me give you a quick demo. So if you go to platform.ai and choose Get Started, you'll be able to create a new project. And if you create a new project, you can either upload your own images. Uploading about 500 or so works pretty well. You can upload a few thousand, but to start upload 500 or so, they all have to be in a single folder. And so we're assuming that you've got a whole bunch of images that you haven't got any labels for. Or you can start with one of the existing collections if you want to play around. So I've started with the cars collection, kind of going back to what we did four years ago. And so this is what happens when you first go into platform.ai and look at the collection of images you've uploaded, a random sample of them will appear on the screen. And as you'll recognize probably, they are projected from a deep learning space into a 2D space using a pre-trained model. And for this initial version, it's an ImageNet model we're using as things move along. We'll be adding more and more pre-trained models. And what I'm going to do is I want to add labels to this data set representing which angle a photo of the car was taken from, which is something that actually ImageNet is going to be really bad at, isn't it? Because ImageNet has learned to recognize the difference between cars versus bicycles. And ImageNet knows that the angle you take a photo on actually doesn't matter. So we want to try and create labels using the kind of thing that actually ImageNet specifically learned to ignore. So the projection that you see, we can click these layer buttons at the top to switch to user projection using a different layer of the neural net, right? And so here's the last layer, which is going to be a total waste of time for us because it's really going to be projecting things based on what kind of thing it thinks it is. And the first layer is probably going to be a waste of time for us as well because there's very little interesting semantic content there. But if we go into the middle in layer three, we may well be able to find some differences there. So then what you can do is you can click on the projection button here, and you can actually just press up and down rather than just pressing the arrows at the top to switch between projections or left and right to switch between layers. And what you can do is you can basically look around until you notice that there's a projection which has kind of separated out things you're interested in. And so this one actually, I notice that it's got a whole bunch of cars that are kind of from the top front right over here. Okay, so if we zoom in a little bit, we can double check. It's like, yeah, that looks pretty good. They're all kind of front right. So we can click on here to go to selection mode. And we can kind of grab a few. And then you should check. And so what we're doing here is we're trying to take advantage of the combination of human plus machine. The machine's pretty good at quickly doing calculations. But as a human, I'm pretty good at looking at a lot of things at once and seeing the odd one out. So in this case, I'm looking for cars that aren't front right. And so by laying the one on the front of me, I can do that really quickly. It's like, okay, definitely that one. So just click on the ones that you don't want. All right, it's all good. So then you can just go back. And so then what you can do is you can either put them into a new category by typing create new label, or you can click on one of the existing ones. So before I came, I just created a few. So here's front right. So just click on it here. There we go, okay. And so that's the basic idea is that you kind of keep flicking through different layers or projections to try and find groups that represent the things you're interested in. And over time, you'll start to realize that there are some things that are a little bit harder. So for example, I'm having trouble finding sides. So what I can do is I can see over here, there's a few sides. So I can zoom in here and click on a couple of them, like this one and this one, that one. I mean, I'll say find similar. And so this is going to basically look in that projection space and not just at the images that are currently displayed, but all of the images that you uploaded. And hopefully I might be able to label now a few more side images at that point. So it's going through and checking all of the images that you uploaded to see if any of them have projections in this space which are similar to the ones I've selected and hopefully we'll find a few more of what I'm interested in. Okay, so now if I want to try to find a projection that separates the sides from the front right, I can click on each of those two and then over here, this button is now called switch to the projection that maximizes the distance between the labels. So now what this is going to do is try and find the best projection that separates out those classes. And so the goal here is to help me visually inspect and quickly find a bunch of things that I can use to label. So like they're the kind of the key features and it's done a good job. You can see down here, we've now got a whole bunch of sides which I can now grab because I was having a lot of trouble finding them before. And it's always worth double checking. And it's kind of interesting to see how the neural nets behave. Like there seems to be more sports cars in this group than average as well. So it's kind of found side angles of sports cars. So that's kind of interesting. So then I can click, all right, so I've got those. So now I'll click side and there we go. So once you've done that a few times, I find if you've got a hundred or so labels, you can then click on the train model button. And it'll take a couple of minutes and come back and show you your train model. And after it's trained, which I did it on a smaller number of labels earlier, you can then switch this very opacity button. And it'll actually kind of fade out the ones that are already predicted pretty well. And it'll also give you an estimate as to how accurate it thinks the model is. The main reason I mention this for you is that so that you can now click the download button. And it'll download the predictions, which is what we hope will be interesting to most people. But what I think will be interesting to you as deep learning students is it will download your labels. So now you can use that labeled subset of data along with the unlabeled set that you haven't labeled yet to see if you can build a better model than Platform AI has done for you. See if you can use that initial set of data to kind of get going, creating models of stuff which you weren't able to label before. Clearly there are some things that this system is better than others. For things that require really zooming in closely and taking a very, very close inspection, this isn't going to work very well. This is really designed for things that the human eye can kind of pick up fairly readily. But we'd love to get feedback as well. And you can click on the help button to give feedback. And also there's a Platform AI discussion topic in our forum where, so Ashak, if you can stand up, Ashak's the CEO of the company. He'll be there helping out, answering questions, and so forth. So yeah, I hope people find that useful. It's been many years getting to this point, and I'm glad we're finally there. Okay, so one of the reasons I wanted to mention this today is that we're going to be doing a big dive into convolutions later in this lesson. So I'm going to circle back to this to try and explain a little bit more about how that is working under the hood and give you a kind of a sense of what's going on. But before we do, we have to finish off last week's discussion of regularization. And so we were talking about regularization specifically in the context of the tabular learner. Because the tabular learner, this was the forward method. Sorry, this is the init method in the tabular learner. And our goal was to understand everything here. And we're not quite there yet. Last week, we were looking at the adult data set, which is a really simple, kind of over-simple data set that's just for toy purposes. So this week, let's look at a data set that's much more interesting, a Kaggle competition data set. So we know kind of what the best in the world. And Kaggle competition results tend to be much harder to beat than academics data-the-art results tend to be. Because a lot more people work on Kaggle competitions than most academic data sets. So it's a really good challenge to try and do well on a Kaggle competition data set. So this one, the Rossman data set, they've got 3,000 drugstores in Europe. And you're trying to predict how many products they're going to sell in the next couple of weeks. So one of the interesting things about this is that the test set for this is from a time period that is more recent than the training set. And this is really common, right? If you want to predict things, there's no point predicting things that are in the middle of your training set. You want to predict things in the future. Another interesting thing about it is the evaluation metric they provided is the root mean squared percent error. So this is just a normal root mean squared error, except we go actual minus prediction divided by actual. So in other words, it's the percent error that we're taking the root mean squared of. So there's a couple of interesting features. Always interesting to look at the leaderboard. So the leaderboard, the winner was 0.1. The paper that we've roughly replicated was 0.105, 0.106. And 10th place out of 3,000 was 0.11-ish, bit less. All right. So we're going to skip over a little bit, which is that the data that was provided here was they provided a small number of files, but they also let competitors provide additional external data, as long as they shared it with all the competitors. And so in practice, the data set we're going to use contains, I can't remember, six or seven tables. The way that you join tables and stuff isn't really part of a deep learning course, so I'm going to skip over it. And instead, I'm going to refer you to introduction to machine learning for coders, which will take you step-by-step through the data preparation for this. We've provided it for you in Rosman DataClean, so you'll see the whole process there. And so you'll need to run through that notebook to create these pickle files that we read here. Can you see this in the back? OK. Great. I just want to mention one particularly interesting part of the Rosman DataClean notebook, which is you'll see there's something that says add date part, and I wanted to explain what's going on here. I've been mentioning for a while that we're going to look at time series. And pretty much everybody who I've spoken to about it has assumed that I'm going to do some kind of recurrent neural network, but I'm not. Interestingly, the main academic group that studies time series is econometrics, but they tend to study one very specific kind of time series, which is where the only data you have is a sequence of time points of one thing. That's the only thing you have is one sequence. In real life, that's almost never the case. Normally, if we would have some information about the store that that represents or the people that it represents, we'd have metadata, we'd have sequences of other things measured at similar time periods or different time periods. And so most of the time I find in practice, the state-of-the-art results when it comes to competitions on kind of more real-world data sets don't tend to use recurrent neural networks, but instead they tend to take the time piece, which in this case it was a date we were given in the data, and they add a whole bunch of metadata. So in our case, for example, we've added day of week. So we were given a date. We've added day of week, year, month, week of year, day of month, day of week, day of year, and then a bunch of booleans. Is it the month, start, or end, quarter, year, start, or end, elapsed time since 1970, so forth. If you run this one function, add date part and pass it a date, it'll add all of these columns to your data set for you. And so what that means is that let's take a very reasonable example. Purchasing behavior probably changes on payday. Payday might be the 15th of the month. So if you have a thing here called, this is day of month here, right, then it'll be able to recognize every time something is a 15 there and associated it with a higher, in this case, embedding matrix value. So this way, basically, we can't expect a neural net to do all of our feature engineering for us. We can expect it to kind of find nonlinearities and interactions and stuff like that, but for something like taking a date like this and figuring out that the 15th of the month is something when interesting thing happen, it's much better if we can provide that information for it. So this is a really useful function to use. So once you've done this, you can treat many kinds of time series problems as regular tabular problems. I say many kinds, not all, if there's very complex kind of state involved in a time series such as equity trading or something like that, this probably won't be the case, or this won't be the only thing you need. But in this case, it'll get us a really good result. And in practice, most of the time, I find this works well. Tabular data is normally in pandas, so we just stored them as standard Python pickle files. We can read them in. We can take a look at the first five records. And so the key thing here is that we're trying to, on a particular date, for a particular store ID, we want to predict the number of sales. Sales is the dependent variable. So the first thing I'm going to show you is something called preprocessors. You've already learned about transforms. Transforms are bits of code that run every time something is grabbed from a data set. And so it's really good for data augmentation that we'll learn about today, which is that it's going to get a different random value every time it's sampled. Preprocessors are like transforms, but they're a little bit different, which is that they run once before you do any training. And really importantly, they run once on the training set and then any kind of state or metadata that's created is then shared with the validation and test set. Let me give you an example. When we've been doing image recognition and we've had a set of classes for like all the different pet breeds and they've been turned into numbers, the thing that's actually doing that for us is a preprocessor that's being created in the background. So that makes sure that the classes for the training set are the same as the classes for the validation and the classes for the test set. So we're going to do something very similar here. For example, if we create a little small subset of the data for playing with, this is a really good idea when you start with a new data set. So I've just grabbed 2000 IDs at random, okay? And then I'm just going to grab a little training set and a little test set, half and half of those 2000 IDs. I'm just going to grab five columns, okay? And then we can just play around with this nice and easy. So here's the first few of those from the training set. As you can see, one of them is called promo interval and it has these strings. And sometimes it's missing. In pandas, missing is N-A-N. So the first preprocessor I'll show you is Categorify. And Categorify does basically the same thing that that classes thing for image recognition does for our dependent variable. It's going to take these strings. It's going to find all of the possible unique values of it. And it's going to create a list of them. And then it's going to turn the strings into numbers. So if I call it on my training set, that'll create categories there. And then I call it on my test set, passing in test equals true. That makes sure it's going to use the same categories that I had before. And now when I say .head, it looks exactly the same. And that's because pandas has turned this into a categorical variable, which internally is storing numbers. But externally is showing me the strings. But I can look inside promo interval to look at the cat categories. This is all standard pandas here. To show me a list of all of the, what we would call classes in fast AI or would be called just categories in pandas. And so then if I look at the cat.codes, you can see here, this list here is the numbers that are actually stored. Minus 1, minus 1, 1, minus 1, 1, right? What are these minus 1s? The minus 1s represent NIN, they represent missing. So pandas uses the special code minus 1 to mean missing. Now, as you know, these are going to end up in an embedding matrix. And we can't look up item minus 1 in an embedding matrix. So internally in fast AI, we add 1 to all of these. Another useful pre-processor is fixed missing. And so again, you can call it on the data frame. You can call it on the test, passing in test equals true. And this will create for everything that's missing, anything that has a missing value, it'll create an additional column with the column name underscore NA. So competition distance underscore NA. And it'll set it for true for any time that was missing. And then what we do is we replace competition distance with the median for those. Why do we do this? Well, because very commonly the fact that something's missing is of itself interesting. Like, it turns out the fact that this is missing helps you predict your outcome, right? So we've certainly want to keep that information in a convenient Boolean column so that our deep learning model can use it to predict things. But then we need competition distance to be a continuous variable. So we can use it in the continuous variable part of our model. So we can replace it with almost any number, right? Because if it turns out that the missingness is important, it can use the interaction of competition distance NA and competition distance to make predictions. So that's what fixed missing does. You don't have to manually call preprocessors yourself. When you call any kind of item list creator, you can pass in a list of preprocessors which you can create like this. Okay, so this is saying, okay, I want to feel missing. I want to categorify. I want to normalize. So for continuous variables, it'll subtract the mean and divide by the standard deviation to help it train more easily. And so you just say those are my prox and then you can just pass it in there and that's it. And later on, you can go data.export and it'll save all the better data for that data bunch. So you can later on load it in knowing exactly what your category codes are, exactly what median values used for replacing the missing values and exactly what means and standard deviations you normalize by. Okay, so the main thing you have to do if you want to create a data bunch of tabular data is find out or tell it, what are your categorical variables and what are your continuous variables? And as we discussed last week briefly, your categorical variables are not just strings and things, but also I include things like day of week and month and day of month. Even though they're numbers, I make them categorical variables because for example, day of month, I don't think it's gonna have a nice smooth curve. I think that the 15th of the month and the first of the month and the 30th of the month are probably gonna have different purchasing behavior to other days of the month. And so therefore if I make it a categorical variable, it's gonna end up creating an embedding matrix and those different days of the month can get different behaviors. So you've actually got to think carefully about which things should be categorical variables. And on the whole, if in doubt, and there are not too many levels in your category, that's called the cardinality. If your cardinality is not too high, I would have put it as a categorical variable. You can always try and each and see which works best. So our final data frame that we're gonna pass in is gonna be our training set with the categorical variables and the continuous variables and the dependent variable and the date. And the date we're just gonna use to create a validation set where we're basically gonna say the validation set is gonna be the same number of records at the end of the time period that the test set is for Kaggle. And so that way we should be able to validate our model nicely. Okay, so now we can create a tabular list. So this is our standard data block API that you've seen a few times from a data frame. Pass in all of that information, split it into valid versus train, label it with a dependent variable. And here's something I don't think you've seen before, label class. This is our dependent variable. And as you can see, this is sales. It's not a float, it's an int64. If this was a float, then FastAI would automatically know or guess that you wanna do a regression. Okay, but this is not a float, it's an int. So FastAI is gonna assume you wanna do a classification. So when we label it, we have to tell it that the class of the labels we want is a list of floats. Okay, not a list of categories, which would otherwise be the default. So this is the thing that's gonna automatically turn this into a regression problem for us. And then we create a data bunch. So I wanted to remind you again about doc, which is how we find out more information about this stuff. In this case, all of the labeling functions in the data blocks API will pass on any keywords they don't recognize to the label class. So one of the things I've passed in here is log. And so that's actually gonna end up in float list. And so if I go doc float list, I can see a summary. Okay, and I can even jump into the full documentation. And it shows me here that log is something which if true, it's going to take the logarithm of my dependent variable. Why am I doing that? So this is the thing that's actually gonna automatically take the log of my way. The reason I'm doing that is because as I mentioned before, the evaluation metric is root mean squared percentage error. And fast AI, neither fast AI nor PyTorch has a root mean squared percentage error loss function built in. I don't even know if such a loss function would work super well. But if you wanna spend the time thinking about it, you'll notice that this ratio, if you first take the log of y and y hat, then becomes a difference rather than a ratio. So in other words, if you take the log of y, then this becomes root mean squared error. So that's what we're gonna do. We're gonna take the log of y and then we're just gonna use root mean squared error, which is the default for regression problem. So we won't even have to mention it. The reason that we have this here is because this is so common, right? Basically, anytime you're trying to predict something that's like a population or a dollar amount of sales, these kind of things tend to have long tail distributions where you care more about percentage differences than exact differences, absolute differences. So you're very much, very likely to want to do things with log equals true and to measure the root mean squared percent error. We've learned about the y range before, which is gonna use that sigmoid to help us get in the right range. Because this time the y values are gonna be taken the log of at first, we need to make sure that the y range we want is also the log. So I'm gonna take the maximum of the sales column. I'm gonna multiply it by a little bit so that, because remember how we said it's nice if your range is a bit wider than the range of the data. And then we're gonna take the log. And that's gonna be our maximum. So then our y range will be from zero to a bit more than the maximum. So now we've got our data bunch. We can create a tabular learner from it. And then we have to pass in our architecture. And as we briefly discussed for a tabular model, our architecture is literally the most basic fully connected network just like we showed in this picture. It's an input, matrix multiply, nonlinearity, matrix multiply, nonlinearity, matrix multiply, nonlinearity, done. One of the interesting things about this is that this competition is three years old, but I'm not aware of any significant advances, at least in terms of architecture, that would cause me to choose something different to what the third-placed folks did three years ago. We're still basically using simple, fully connected models for this problem. Now, the intermediate weight matrix is gonna have to go from a 1000 activation input to a 500 activation output, which means it's gonna have to be 500,000 elements in that weight matrix. That's an awful lot for a dataset with only a few 100,000 rows. So this is gonna overfit, and we need to make sure it doesn't. So one way to make sure it doesn't, well, the way to make sure it doesn't is to use regularization, not to reduce the number of parameters, to use regularization. So one way to do that will be to use weight decay, which FastAI will use automatically, and you can vary it to something other than the default, if you wish. It turns out in this case, we're gonna want more regularization. And so we're gonna pass in something called keys. This is gonna provide dropout, and also this one here, mdrop, this is gonna provide embedding dropout. So let's learn about what is dropout. But the short version is, dropout is a kind of regularization. This is the dropout paper. Nitish, how do you say this? Surya Vastava. It was Surya Vastava's master's thesis under Jeffrey Hinton. And this picture from the original paper is a really good picture of what's going on. This first picture is a picture of a standard fully connected network. It's a picture of this. And what each line shows is a multiplication of an activation times a weight. And then when you've got multiple arrows coming in, that represents a sum. So this activation here is the sum of all of these inputs times all of these activations. So that's what a normal neural, fully connected neural net looks like. For dropout, we throw that away. At random, we throw away some percentage of the activations, not the weights, not the parameters. Remember there's only two types of number in a neural net. Parameters, also called weights kind of, and activations. So we're gonna throw away some activations. So you can see that when we throw away this activation, all of the things that were connected to it are gone too. For each mini batch, we throw away a different subset of activations. How many do we throw away? We throw them each one away with a probability P. A common value of P is 0.5. So what does that mean? And you'll see in this case, not only have they deleted at random some of these hidden layers, but they've actually deleted some of the inputs as well. Deleting the inputs is pretty unusual. Normally, we only delete activations in the hidden layers. So what does this do? Well, every time I have a mini batch going through, I at random throw away some of the activations. And then the next mini batch, I put them back and I throw away some different ones. So it means that it's no one activation can kind of memorize some part of the input because that's what happens if we overfit, right? If we overfit, some part of the model is basically learning to recognize a particular image rather than a feature in general or a particular item. With Dropout, it's gonna be very hard for it to do that. In fact, Jeffrey Hinton described one of the kind of part of the thinking behind this as follows. He said, he noticed every time he went to his bank that all the tellers and stuff moved around. And he realized the reason for this must be that they're trying to avoid fraud. If they keep moving them around, nobody can specialize so much in that one thing that they're doing that they can figure out kind of a conspiracy to defraud the bank. Now, of course, depends when you ask Hinton at other times, he says that the reason for this was because he thought about how spiking neurons work. And he's a neuroscientist by training. There's a view that spiking neurons might help regularization and Dropout is kind of a way of matching this idea of spiking neurons. I mean, it's interesting. When you actually ask people, where did your idea for some algorithm come from? It basically never comes from math. It always comes from intuition and kind of thinking about physical analogies and stuff like that. So anyway, the truth is a bunch of ideas, I guess, we're all flowing around and they came up with this idea of Dropout. But the important thing to know is it worked really, really well, right? And so we can use it in our models to get generalization for free. Now, too much Dropout, of course, is reducing the capacity of your model. So it's going to underfit. And so you've got to play around with different Dropout values for each of your layers to decide. So in pretty much every fast AI learner, there's a parameter called P's, PS, which will be the P value for the Dropout for each layer. So you can just pass in a list. Or you can pass in an int and it'll create a list with that value everywhere. Sometimes it's a little different for CNN, for example. It actually, if you pass in an int, it will use that for the last layer and half that value for the earlier layers. We basically try to do things that kind of represent best practice. But you can always pass in your own list to get exactly the Dropout that you want. There is an interesting feature of Dropout, which is that we talk about training time and test time. Test time, we also call inference time. Training time is when we're actually doing that those weight updates, the back propagation. And the training time Dropout works the way we just saw. At test time, we turn off Dropout. We're not going to do Dropout anymore because we want it to be as accurate as possible. We're not training, so we can't cause it to overfit when we're doing inference. So we remove Dropout. But what that means is if previously P was 0.5, then half the activations were being removed, which means when they're all there, now our overall activation level is twice what it used to be. And so therefore in the paper, they suggest multiplying all of your weights at test time by P. Interestingly, you can dig into the PyTorch source code and you can find the actual C code where Dropout is implemented. And here it is. And you can see what they're doing is something quite interesting. They first of all do a Bernoulli trial. So a Bernoulli trial is with probability one minus P, return the value one, otherwise return the value zero. That's all it means. So in this case, P is the probability of Dropout. So one minus P is the probability that we keep the activation. So we end up here with either a one or a zero. And then, this is interesting, we divide in place, remember underscore means in place in PyTorch, we divide in place that one or zero by one minus P. If it's a zero, nothing happens, it's still zero. If it's a one and P was 0.5, that one now becomes two. And then finally, we multiply in place our input by this noise, this Dropout mask. So in other words, we actually don't do in PyTorch, we don't do the change at test time. We actually do the change at training time, which means that you don't have to do anything special at inference time with PyTorch. That's not just PyTorch, it's quite a common pattern. But it's kind of nice to look inside the PyTorch source code and see, Dropout, this incredibly cool, incredibly valuable thing is really just these three lines of code, which they do in C because I guess it ends up a bit faster when it's all fused together, but lots of libraries do it in Python and that works well as well. You can even write your own Dropout layer and it should give exactly the same results as this. So that'd be a good exercise to try. See if you can create your own Dropout layer in Python and see if you can replicate the results that we get with this Dropout layer. So that's Dropout. And so in this case, we're gonna use a tiny bit of Dropout on the first layer and a little bit of Dropout on the next layer and then we're gonna use special Dropout on the embedding layer. Now, why do we use special Dropout on the embedding layer? So if you look inside the fast.ai source code, here's our tabular model. You'll see that in the section that checks that there's some embeddings, we call each embedding and then we concatenate the embeddings into a single matrix and then we call embedding Dropout. And embedding Dropout is simply just a Dropout, right? So it's just an instance of a Dropout module. This kind of makes sense, right? For continuous variables, that continuous variable is just in one column. You wouldn't wanna do Dropout on that because you're literally deleting the existence of that whole input, which is almost certainly not what you want. But for an embedding, an embedding is just effectively a matrix multiply by a one-hot encoded matrix. So it's just another layer. So it makes perfect sense to have Dropout on the output of the embedding because you're putting Dropout on those activations of that layer. And so you're basically saying, let's delete at random some of the results of that embedding, some of those activations. So that makes sense. The other reason we do it that way is because I did very extensive experiments about a year ago, where on this data set, I tried lots of different ways of doing kind of everything. And you can actually see it here. I put it all in a spreadsheet, of course, Microsoft Excel, put them into a pivot table to summarize them all together to find out kind of which different choices and hyperparameters and architectures work well and work less well. And then I created all these little graphs. And these are like little summary training graphs for different combinations of hyperparameters and architectures. And I found that there was one of them which ended up consistently getting a good predictive accuracy. The kind of bumpiness of the training was pretty low. And you can see it was just a nice, smooth curve. And so this is an example of the kind of experiments that I do that end up in the fast AI library. So embedding Dropout was one of those things that I just found work really well. And basically the results of these experiments is why it looks like this rather than something else. Well, it's a combination of these experiments. But then why did I do these particular experiments? Well, because it was very influenced by what worked well in that Kaggle Prize Winners paper. But there were quite a few parts of that paper I thought. There were some other choices they could have made. I wonder why they didn't. And I tried them out and found out what actually works and what doesn't work as well. And found a few little improvements. So that's the kind of experiments that you can play around with as well when you try different models and architectures, different Dropouts, layer numbers, number of activations, and so forth. So having created our learner, we can type learn.model to take a look at it. And as you would expect, in that there is a whole bunch of embeddings. Each of those embedding matrices tells you, well, this is the number of levels of the input for each input. And you can match these with your list catbars. So the first one will be store. So that's not surprising. There are 1,116 stores. And then the second number, of course, is the size of the embedding. And that's a number that you get to choose. And so FastAI has some defaults, which actually work really, really well nearly all the time. So I almost never change them. But when you create your tabular learner, you can absolutely pass in an embedding size dictionary, which maps variable names to embedding sizes for anything where you want to override the defaults. And then we've got our embedding dropout layer. And then we've got a batch norm layer with 16 inputs. OK, the 16 inputs make sense because we have 16 continuous variables. The length of count names is 16. So this is something for our continuous variables. And specifically, it's over here, bn count on our continuous variables. And bn count is a batch norm 1d. What's that? Well, the first short answer is, it's one of the things that I experimented with as to having batch norm or not in this. And I found that it worked really well. And then specifically what it is is extremely unclear. Let me describe it to you. It's kind of a bit of regularization. It's kind of a bit of training helper. It's called batch normalization. And it comes from this paper. Actually, before I do this, I just want to mention one other really funny thing. Dropout. I mentioned it was a master's thesis. Not only was it a master's thesis, one of the most influential papers of the last 10 years, it was rejected from the main neural nets conference, what was then called NIPS, now called NeurIPS. I think it's very interesting because it's just a reminder that, A, our academic community is generally extremely poor at recognizing which things are going to turn out to be important. Generally, people are looking for stuff that are in the field that they're working on and understand. So Dropout kind of came out of left field. It's kind of hard to understand what's going on. And so that's kind of interesting. And so it's a reminder that if you just follow, as you kind of develop it beyond being just a practitioner into actually doing your own research, don't just focus on the stuff everybody's talking about. Focus on the stuff you think might be interesting. Because the stuff everybody's talking about generally turns out not to be very interesting. The community is very poor at recognizing high impact papers when they come out. That's normalization on the other hand was immediately recognized as high impact. I definitely remember everybody talking about it in 2015 when it came out. And that was because it was so obvious. They showed this picture, showing the current then state-of-the-art ImageNet model inception. This is how long it took them to get a pretty good result. And then they tried the same thing with this new thing called batch norm. And they just did it way, way, way quickly. And so that was enough for pretty much everybody to go, wow, this is interesting. And specifically they said, this thing's called batch normalization and it's accelerating training by reducing internal covariate shift. So what is internal covariate shift? Well, it doesn't matter because this is one of those things where researchers came up with some intuition and some idea about this thing they wanted to try. They did it, it worked well. They then post hoc added on some mathematical analysis to try and claim where it worked. And it turned out they were totally wrong. In the last two months, there's been two papers. So it took three years for people to really figure this out. And the last two months there's been two papers that have shown batch normalization doesn't reduce covariate shift at all. And even if it did, that has nothing to do with where it works. So I think that's a kind of an interesting insight again, which is like why we should be focusing on being practitioners and experimentalists and developing an intuition. What batch norm does is what you see in this picture here, in this paper. Here are steps or batches, and here is loss. And here, the red line is what happens when you train without batch norm, very, very bumpy. And here the blue line is what happens when you train with batch norm, not very bumpy at all. What that means is you can increase your learning rate with batch norm because these big bumps represent times that you're really at risk if you're set of weights jumping off into some awful part of the weight space that it can never get out of again. So if it's less bumpy, then you can train at a higher learning rate. So that's actually what's going on. And here's what it is. This is the algorithm. And it's really simple. The algorithm is gonna take a mini batch, right? So we have a mini batch, and remember, this is a layer, so the thing coming into it is activations, okay? So it's a layer and it's gonna take in some activations. And so as activations, it's calling x1, x2, x3, and so forth. The first thing we do is we find the mean of those activations. Some divided by the count, that's just the mean. And the second thing we do is we find the variance of those activations. Different squared divided by the mean is the variance. And then we normalize. So the values minus the mean divided by the temperature deviation is the normalized version. Okay, it turns out that bit's actually not that important. We used to think it was. Okay, but it turns out it not. The really important bit is the next bit. We take those values and we add a vector of biases. They call it beta here. And we've seen that before. We've used a bias term before, okay? So we're just gonna add a bias term as per usual. And then we're gonna use another thing that's a lot like a bias term, but rather than adding it, we're gonna multiply by it. So there's these parameters gamma and beta, which are learnable parameters. Remember in a neural net, there's only two kinds of number, activations and parameters. These are parameters, okay? They're things that are learnt with gradient descent. This is just a normal bias layer, beta. And this is a multiplicative bias layer. Nobody calls it that, but that's all it is, right? It's just like bias, but we multiply rather than add. That's all batch norm is. That's what the layer does. So why is that able to achieve this fantastic result? I'm not sure anybody has exactly written this down before. If they have, I apologize for failing to cite it because I haven't seen it, but let me explain what's actually going on here. The value of our predictions, Y hat, is some function of our various weights. There could be millions of them, weight one million. And it's also a function, of course, of the inputs to our layer. This function here is our neural net function, whatever is going on in our neural net. And then our loss, let's say it's mean squared error, is just our actuals minus our predicted squared, okay? So let's say we're trying to predict movie review outcomes, and they're between one and five, okay? And we've been trying to train our model, and the activations at the very end are currently between minus one and one. So they're way off where they need to be. The scale is off, the mean is off. So what can we do? One thing we could do would be to try and come up with a new set of weights that cause the spread to increase and cause the mean to increase as well. But that's gonna be really hard to do because remember all these weights interact in very intricate ways, right? We've got all those non-linearities and they all combine together. So to kind of just move up, it's gonna require navigating through this complex landscape and we use all these tricks like momentum and Adam and stuff like that to help us. But it still requires a lot of twiddling around to get there. So that's gonna take a long time and it's gonna be bumpy. But what if we did this? What if we went times G plus B? We added two more parameter vectors. Or now it's really easy, right? In order to increase the scale, that number has a direct gradient to increase the scale. To change the mean, that number has a direct gradient to change the mean. There's no interactions or complexities. It's just straight up and down, straight in and out. And that's what batch norm does, right? So batch norm is basically making it easier for it to do this really important thing which is to shift the outputs up and down and in and out. And that's why we end up with these results. So those details in some ways don't matter terribly. The really important thing to know is you definitely wanna use it, right? Or if not, it's something like it. There's various other types of normalization around nowadays. But batch norm works great. The other main normalization type we use in FastAI is something called weight norm, which is a much more just in the last few months development. Okay, so that's batch norm. And so what we do is we create a batch norm layer for every continuous variable. N-cont is the number of continuous variables. In FastAI, N underscore something always means the count of that thing. Cont always means continuous. So then here is where we use it. We grab our continuous variables and we throw them through a batch norm layer. And so then over here, you can see it in our model. One interesting thing is this momentum here. This is not momentum like in optimization, but this is momentum as in exponentially weighted moving average. Specifically, this mean and standard deviation, we don't actually use a different mean and standard deviation for every mini batch. If we did, it would vary so much that it'd be very hard to train. So instead, we take an exponentially weighted moving average of the mean and standard deviation. And if you don't remember what I mean by that, look back at last week's lesson to remind yourself about exponentially weighted moving averages, which we implemented in Excel for the momentum and atom gradient squared terms. You can vary the amount of momentum in a batch norm layer by passing a different value to the constructor in PyTorch. If you use a smaller number, it means that the mean and standard deviation will vary less from mini batch to mini batch, and that will have less of a regularization effect, a larger number will mean the variation will be greater from mini batch to mini batch, that will have more of a regularization effect. So as well as this thing of training more nicely because it's parameterized better, this momentum term in the mean and standard deviation is the thing that adds this nice regularization piece. When you add batch norm, you should also be able to use a higher learning rate. So that's our model. So then you can go LR find, you can have a look, and then you can go fit, you can save it, you can plot the losses, you can fit a bit more, and we end up at 0.103, 10th place in the competition was 0.108, so it's looking good, all right? Again, take it with a slight grain of salt because what you actually need to do is use the real training set and submit it to Kaggle, but you can see we're very much amongst the kind of cutting edge of models, at least as of 2015, and as I say, there haven't really been any architectural improvements since then. There wasn't batch norm when this was around, so the fact we added batch norm means that we should get better results and certainly more quickly, and if I remember correctly, in their model they had to train at a lower learning rate for quite a lot longer. As you can see, this is about less than 45 minutes of training. So that's nice and fast. Any questions? In what proportion would you use dropout versus other regularization errors, like weight decay, L2 norms, et cetera? So remember that L2 regularization and weight decay are kind of two ways of doing the same thing, and we should always use the weight decay version, not the L2 regularization version. So there's weight decay, there's batch norm, which kind of has a regularizing effect. There's data augmentation, which we'll see soon, and there's dropout. So batch norm we pretty much always want, so that's easy. Data augmentation we'll see in a moment. So then it's really between dropout versus weight decay. I have no idea. I don't think I've seen anybody provide a compelling study of how to combine those two things. Can you always use one instead of the other? Why, why not? I don't think anybody has figured that out. I think in practice, it seems that you generally want a bit of both. You pretty much always want some weight decay, but you often also want a bit of dropout. But honestly, I don't know why, and I've not seen anybody really explain why or how to decide. So this is one of these things you have to try out and kind of get a feel for what tends to work for your kinds of problems. I think the defaults that we provide in most of our learners should work pretty well in most situations, but yeah, definitely play around with it. Okay, the next kind of regularization we're gonna look at is data augmentation. And data augmentation is one of the least well studied types of regularization, but it's the kind that I think I'm kind of the most excited about. The reason I'm kind of the most excited about it is that you basically, there's basically almost no cost to it. You can do data augmentation and get better generalization without it taking longer to train, without underfitting to an extent, at least. So let me explain. So what we're gonna do now is we're gonna come back to computer vision and we're gonna come back to our pets data set again. So let's load it in, our pets data set, the images were inside the images subfolder. I'm gonna call getTransforms as per usual, but when we call getTransforms, there's a whole long list of things that we can provide. And so far we haven't been varying that much at all, but in order to really understand data augmentation, I'm gonna kind of ratchet up all of the defaults. So there's a parameter here for what's the probability of an affine transform happening? What's the probability of a lighting transform happening? So I set them both to one. So they're all gonna get transformed. I'm gonna do more rotation, more zoom, more lighting transforms and more warping. What do all those mean? Well, you should check the documentation and you do that by typing doc. And there's the brief documentation, but the real documentation is in docs. So I'll click on show in docs and here it is. And so this tells you what all of those do, but generally the most interesting parts of the docs tend to be at the top where you kind of get the summaries of what's going on. And so here there's something called list of transforms. And here you can see every transform has something showing you lots of different values of it, right? So here's brightness. So make sure you read these and remember these notebooks you can open up and run this code yourself and get this output. All of these HTML documentation documents are auto-generated from the notebooks in the docs underscore source directory in the fast AI repo, right? So you will see the exact same cats if you try this. Silver really likes cats, so there's a lot of cats in the documentation. And I think, you know, because he's been so awesome at creating great documentation, he gets to pick the cats. So for example, looking at different values of brightness, what I do here is I look to see two things. The first is for which of these levels of transformation is it's still clear what the picture is a picture of. So this is kind of getting to a point where it's pretty unclear. This is possibly getting a little unclear. The second thing I do is I look at the actual data set that I'm modeling or particularly the data set that I'll be using as validation set. And I try to get a sense of what the variation in this case in lighting is. So if they're like nearly all professionally taking photos, I would probably want them all to be about in the middle. But if the kind of their photos that are taken size and pretty amateur photographers, they're likely to be some of the overexposed, some very underexposed, right? So you should pick a value of this data augmentation for brightness that both allows the image to still be seen clearly and also represents the kind of data that you're gonna be using this to model on in practice. So you gotta say the same thing for contrast, right? It'd be unusual to have a data set with such ridiculous contrast. But perhaps you do, in which case you should use data augmentation up to that level. But if you don't, then you shouldn't. This one called dihedral is just one that does every possible rotation and flip. And so obviously most of your pictures are not gonna be upside down cats, right? So you probably would say, hey, this doesn't make sense. I won't use this for this data set. But if you were looking at satellite images, of course you would. On the other hand, flip makes perfect sense. So you would include that. A lot of things that you can do with fast AI lets you pick a padding mode. And this is what padding mode looks like. You can pick zeros, you can pick border, which just replicates, or you can pick reflection, which as you can see is it's as if the last little few pixels are in a mirror. Reflections nearly always better, by the way. I don't know that anybody else has really studied this, but we have studied it in some depth. Haven't actually written a paper about it, but just enough for our own purposes to say reflection works best most of the time. So that's the default. Then there's a really cool bunch of perspective warping ones, which I'll probably show you by using symmetric warp. If you look at the kind of the, we've added black borders to this, so it's more obvious for what's going on. And as you can see, what symmetric warp is doing, it's as if the camera is being moved above or to the side of the object and literally warping the whole thing like that, right? And so the cool thing is that as you can see, each of these pictures, it's as if this cat was being taken kind of from different angles, right? So they're all kind of optically sensible, right? And so this is a really great type of data augmentation. It's also one which I don't know of any other library that does it, or at least certainly one that does it in a way that's both fast and keeps the image crisp as it is in Fast.ai. So this is like, if you're looking to win a Kaggle competition, this is the kind of thing that's gonna get you above the people that aren't using the Fast.ai library. So having looked at all that, we are going to add this, have a little get data function that just does the usual data block stuff, but we're gonna add padding mode explicitly so that we can turn on padding mode of zeros just so we can see what's going on better. Fast.ai has this handy little function called plot multi, which is gonna create a three by three grid of plots and each one will contain the result of calling this function, which will receive the plot coordinates and the axis. And so I'm actually gonna plot the exact same thing in every box, but because this is a training data set, it's gonna use data augmentation. And so you can see the same doggie using lots of different kinds of data augmentation. And so you can see why this is gonna work really well because these pictures all look pretty different, right? But we didn't have to do any extra hand labeling or anything, they're like, that's like free extra data. Okay, so data augmentation is really, really great. And one of the big opportunities for research is to figure out ways to do data augmentation in other domains. So how can you do data augmentation with text data or genomic data or histopathology data or whatever, right? Almost nobody's looking at that. And to me it's one of the biggest opportunities that could let you decrease data requirements by like five to 10x. So here's the same thing again, but with reflection padding instead of zero padding. And you can kind of see, like see this doggie's legs are actually being reflected at the bottom here. So reflection padding tends to create images that are kind of much more naturally reasonable, like in the real world you don't get black borders like this, so they do seem to work better. Okay, so because we're gonna study convolutional neural networks, we are gonna create a convolutional neural network. You know how to create them. So I'll go ahead and create one. I will fit it for a little bit. I will unfreeze it. I will then create a larger version of the dataset, 352 by 352 and fit for a little bit more. And I will save it. Okay, so we have a CNN and we're gonna try and figure out what's going on in our CNN. And the way we're gonna try and figure it out is specifically that we're gonna try to learn how to create this picture. This is a heat map, right? This is a picture which shows me what part of the image did the CNN focus on when it was trying to decide what this picture is. So we're gonna make this heat map from scratch. When we, so we're kind of at a point now in the course where I'm assuming that if you've got to this point, you know when you're still here, thank you, then you're interested enough that you're prepared to kind of dig into some of these details. So we're actually gonna learn how to create this heat map without almost any fast AI stuff. We're gonna use pure kind of tensor arithmetic in PyTorch and we're gonna try and use that to really understand what's going on. So to warn you, none of it's rocket science, but a lot of it's gonna look really new. So don't expect to get it the first time, but expect to like listen, jump into the notebook, try a few things, test things out, look particularly at like tensor shapes and inputs and outputs to check your understanding, then go back and listen again, right? And kind of try it a few times because you will get there, right? It's just that there's gonna be a lot of new concepts because we haven't done that much stuff in pure PyTorch. Okay, so what we're gonna do is we're gonna have a seven minute break and then we're gonna come back and we're gonna learn all about the innards of a CNN. So I'll see you at 750. So let's learn about convolutional neural networks. You know, the funny thing is it's pretty unusual to get close to the end of a course and only then look at convolutions, but like when you think about it, knowing actually how batch norm works or how dropout works or how convolutions work isn't nearly as important as knowing how it all goes together and what to do with them and how to figure out how to do those things better. But it's, you know, we're kind of at a point now where we wanna be able to do things like that. And although, you know, we're adding this functionality directly into the library so you can kind of run a function to do that, you know, the more you do, the more you'll find things that you wanna do a little bit differently to how we do them. Or there'll be something in your domain where you think like, oh, I could do a slight variation of that. So you're kind of getting to a point in your experience now where it helps to know how to do more stuff yourself. And that means you need to understand what's really going on behind the scenes. So what's really going on behind the scenes is that we are creating a neural network that looks a lot like this, right? But rather than doing a matrix multiply here and here and here, we're actually going to do instead a convolution. And a convolution is just a kind of matrix multiply which has some interesting properties. You should definitely check out this website, satosa.io slash ev, explain visually where we have stolen this beautiful animation. It's actually a JavaScript thing that you can actually play around with yourself in order to show you how convolutions work. And it's actually showing you a convolution as we move around these little red squares. So here's a picture, a black and white or a gray scale picture, right? And so each three by three bit of this picture is this red thing moves around. It shows you a different three by three part, right? It shows you over here the values of the pixels, right? So in fast.io's case, our pixel values are between naught and one, in this case, they're between naught and 255, right? So here are nine pixel values. This area is pretty white, so they're pretty high numbers. Okay? And so as we move around, you can see the nine big numbers change and you can also see their colors change. Up here, there's another nine numbers. And you can see those in the little x1, x2, x1, here we are, one, two, one. And what you might see going on is as we move this little red block, as these numbers change, we then multiply them by the corresponding numbers up here. And so let's start using some nomenclature. The thing up here, we are gonna call the kernel, the convolutional kernel. So we're gonna take each little three by three part of this image and we're gonna do an element-wise multiplication of each of the nine pixels that we're mousing over with each of the nine items in our kernel. And so once we multiply each set together, we can then add them all up. And that is what's shown on the right. As the little bunch of red things move over there, you can see there's one red thing that appears over here. The reason there's one red thing over here is because each set of nine, after getting through the element-wise multiplication of the kernel, get added together to create one output. So therefore, the size of this image has one pixel less on each edge than the original, as you can see. See how this black board is on it? That's because at the edge, the three by three kernel can't quite go any further, right? So the furthest you can go is to end up with a dot in the middle, just off the corner, right? So why are we doing this? Well, perhaps you can see what's happened. This face has turned into some white parts outlining the horizontal edges. How? Well, the how is just by doing this element-wise multiplication of each set of nine pixels with this kernel, adding them together and sticking the result in the corresponding spot over here. Why is that creating white spots where the horizontal edges are? Well, let's think about it. Let's look up here. So if we're just in this little bit here, right, then the spots above it are all pretty white, so they have high numbers. So the bits above it, big numbers, they're getting multiplied by one to one. So that's going to create a big number. And the ones in the middle are all zero, so don't care about that. And then the ones underneath are all small numbers because they're all close to zero, so that really doesn't do much at all. So therefore, that little set there is going to end up with bright white. Whereas on the other side, right, down here, you've got light pixels underneath, so they're going to get a lot of negative. Dark pixels on top, which are very small, so not much happens. So therefore, over here, we're going to end up with very negative. So this thing where we take each three-by-three area and element-wise multiply them with a kernel and add each of those up together to create one output is called a convolution. That's it, that's a convolution. So that might look familiar to you, right, because what we did back a while ago is we looked at that Xyla and Fergus paper where we saw like each different layer and we visualized what the weights were doing. And do you remember how the first layer was basically like finding diagonal edges and gradients? That's because that's all a convolution can do, right? Each of our layers is just a convolution. So the first layer can do nothing more than this kind of thing. But the nice thing is the next layer could then take the results of this, right? And it could kind of combine one channel, so the output of one convolutional field is called a channel, right? So it could take one channel that found top edges and another channel that finds left edges and then the layer above that could take those two as input and create something that finds top left corners as we saw when we looked at those Xyla and Fergus visualizations. So let's take a look at this from another angle or quite a few other angles. And we're gonna look at a fantastic post from a guy called Matt Klein Smith who was actually a student in the first year that we did this course. And he wrote this as part of his project work back then. And what he's gonna show here is here is our image. It's a three by three image and our kernel is a two by two kernel. And what we're gonna do is we're gonna apply this kernel to the top left two by two part of this image. And so the pink bit will be correspondingly modeled by the pink bit, the green by the green, and so forth. And they all get added up together to create this top left in the output. So in other words, P equals alpha times A, beta times B, gamma times D, delta times E. There it is. Plus B, which is a bias. Okay, that's fine. That's just a normal bias. So you can see how basically each of these output pixels is the result of some different linear equation. That makes sense. And you can see these same four weights are being moved around because this is our convolutional kernel. Here's another way of looking at it from Matt, which is here is a classic neural network view. And so P now is a result of multiplying every one of these inputs by a weight and then adding them all together, except the gray ones are gonna have a value of zero, right? Because remember P was only connected to A, B, D, and E. A, B, D, and E. So in other words, remembering that this represents a matrix multiplication. Therefore, we can represent this as a matrix multiplication. So here is our list of pixels in our three by three image flattened out into a vector. And here is a matrix vector multiplication plus bias. And then a whole bunch of them we're just gonna set to zero, right? So you can see here we've got a zero, zero, zero, zero, zero, which corresponds to zero, zero, zero, zero, zero. So in other words, a convolution is just a matrix multiplication where two things happen. Some of the entries are set to zero all the time and all of the ones with the same color always have the same weight. So when you've got multiple things with the same weight, that's called weight tying. So clearly we could implement a convolution using matrix multiplication, but we don't because it's slow. So in practice, our libraries have specific convolution functions that we use. And they're basically doing this, which is this, which is this equation, which is as the same as this matrix multiplication. And as we discussed, we have to think about padding because if you have a three by three kernel and a three by three image, then that can only create one pixel of output. There's only one place that this three by three can go. So if we want to create more than one pixel of output, we have to do something called padding, which is to put additional numbers all around the outside. So what most libraries do is that they just put a layer of zeros, not a layer, a bunch of zeros all around the outside. So for a three by three kernel, a single zero on every edge piece here. And so once you've padded it like that, you can now move your three by three kernel all the way across and give you the same output size that you started with. Okay? Now, as we mentioned in Fast AI, we don't normally necessarily use zero padding. We're possible, we use reflection padding, although for these simple convolutions, we often use zero padding because it doesn't matter too much in a big image, it doesn't make too much difference. Okay, so that's what a convolution is. So a convolutional neural network wouldn't be very interesting if it can only create top edges. So we have to take it a little bit further. So if we have an input and it might be, you know, standard kind of red, green, blue, then we can create a kernel, a three by three kernel, like so. And then we could pass that kernel over all of the different pixels. But if you think about it, we actually don't have a 2D input anymore. We have a 3D input, a rank three tensor. So we probably don't want to use the same kernel values for each of red and green and blue. Because for example, if we're creating a green frog detector, we would want more activations on the green than we would on the blue, right? Or if we're trying to find something that can actually find a gradient that goes from green to blue, then the different kernels for each channel need to have different values in. So therefore, we need to create a three by three by three kernel. Okay, so this is still our kernel. And we're still going to vary it across the height and the width. But rather than doing an element-wise modification of nine things, we're going to do an element-wise modification of 27 things. Three by three by three. And we're still going to then add them up into a single number. So as we pass this cube over this and the kind of, like, the little bit that's going to be sitting behind it, right? As we do that part of the convolution, it's still going to create just one number. As we do an element-wise modification of all 27 and add them all together. So we can do that across the whole padded, single unit padded input. And so we started with one, two, three, four, five by five. So we're going to end up with an output that's also five by five. Right? But now, our input was three channels and our output is only one channel. Now, we're not going to be able to do very much with just one channel, because all we've done now is found a top edge. How are we going to find a side edge and a gradient and an area of constant white? Well, we're going to have to create another kernel. And we're going to have to do that convolved over the input. And that's going to create another five by five. And then we can just stack those together across this as another axis. And we can do that lots and lots of times. And that's going to give us another rank three tensor output. So that's what happens in practice. Right? In practice, we start with an input which is h... Sorry. Which is h by w by... For images, three. We pass it through a bunch of convolutional kernels. And we can pick how many we want. And it gives us back an output of... And it gives us back an output of height by width by however many kernels we had. And so often that might be something like 16 in the first layer. And so now we've got 16 channels. They're called 16 channels. Representing things like how much left edge was on this pixel, how much top edge was in this pixel, how much blue to red gradient was on this... Well, on this set of 27 or 9 pixels each with RGB. And so then you can just do the same thing, right? You can have another bunch of kernels. And that's going to create another output, rank three tensor. Again, height by width by whatever. Might still be 16. Now what we really like to do is as we get deeper in the network, we actually want to have more and more channels. We want to be able to find like a richer and richer set of features. So that after a few... As we saw in the Zeiler and Fergus paper by layer four or five, we've kind of got eyeball detectors and fur detectors and things, right? So you really need a lot of channels. So in order to avoid our memory going out of control, from time to time we create a convolution where we don't step over every single set of three by three, but instead we skip over two at a time. So we would start with a three by three centered at 2,2, and then we'd jump over to 2,4, 2,6, 2,8, and so forth. And that's called a stride two convolution. And so what that does is it looks exactly the same, right? It's still just a bunch of kernels, but we're just jumping over two at a time, right? We're skipping every alternate input pixel. And so the output from that will be h over 2 by w over 2. And so when we do that, we generally create twice as many kernels. So we can now have say 32 activations in each of those spots. And so that's what modern convolutional neural networks kind of tend to look like, right? And so we can actually see that if we go into our pets and we grab our CNN, right? And we're going to take a look at this particular cat. So if we go x, y equals valid data set, some index, so let's just grab the 0th. We'll go .show and we'll print out the value of y. Apparently, this cat is of category main coon. So until a week ago, I was not at all familiar that there's a cat called a main coon. Having spent all week with this particular cat, I am now deeply familiar with this main coon. So we can, if we go learn.summary, remember that our input we asked for was 352 by 352 pixels. Generally speaking, the very first convolution tends to have us tried to. So after the first layer, it's 176 by 176. So this is learn.summary. We'll print out for you the output shape after every layer. 176 by 176. And the first set of convolutions has 64 activations. And we can actually see that if we type in learn.summary, you can see here it's a 2D conv with three input channels and 64 output channels and a stride of two. And interestingly, it actually starts with a kernel size of 7 by 7. So like nearly all of the convolutions are 3 by 3. See, they're all 3 by 3. For reasons we'll talk about in part two, we often use a larger kernel for the very first one. If you use a larger kernel, you have to use more padding. So we have to use kernel size in divide by two padding to make sure we don't lose anything. Anyway, so we now have 64 output channels and since it was stride two, it's now 176 by 176. And then as we go along, you'll see that from time to time we have, go from 88 by 88 to 40 by 44, the grid size. So that was a 2D conv. And then when we do that, we generally double the number of channels. So we keep going through a few more convs. And as you can see, they've got batch norm and value. That's kind of pretty standard. And eventually we do it again. Another stride two conv, which again doubles. We've now got 512 by 11 by 11. And that's basically where we finish the main part of the network. We end up with 512 channels, 11 by 11. Okay, so we're actually at a point where we're going to be able to do this heat map now. So let's try and work through it. Before we do, I want to show you how you can do your own manual convolutions, because it's kind of fun. So we're going to start with this picture of a main coon and I've created a convolutional kernel. And so as you can see, this one has a right edge and a bottom edge with positive numbers. And just inside that, it's got negative numbers. So I'm thinking this should show me bottom right edges. Okay, so that's my tensor. Now, one complexity is that that 3 by 3 kernel cannot be used for this purpose because I need two more dimensions. The first is I need the third dimension to say how to combine the red, green, and blue. So what I do is I say dot expand. This is my 3 by 3. And I pop another 3 on the start. What dot expand does is it says create a 3 by 3 by 3 tensor. By simply copying this one three times. I mean, honestly, it doesn't actually copy it. It pretends to have copied it, you know, but it just basically refers to the same block of memory. So it kind of copies it in a memory efficient way. So this one here is now three copies of that. And the reason for that is that I want to treat red and green and blue the same way for this little manual kernel, I'm showing you. And then we need one more axis because rather than actually having a separate kernel, like I kind of printed these as if they were multiple kernels, what we actually do is we use a rank 4 tensor. And so the very first axis is for the every separate kernel that we have. So in this case, I'm just going to create one kernel. So to do a convolution, I still have to put this unit axis on the front. So you can see k dot shape is now 1 comma 3 comma 3 comma 3. So it's a 3 by 3 kernel. There are three of them. And then that's just the one kernel that I have. So it kind of takes a while to get the feel for these higher dimensional tensors because we're not used to writing out the 4D tensor. But like just think of them like this. So 4D tensor is just a bunch of 3D tensors sitting on top of each other. So this is our 4D tensor. And then you can just call conv2d passing in some image. And so the image I'm going to use is the first part of my validation dataset and the kernel. There's one more trick, which is that in PyTorch, pretty much everything is expecting to work on a mini batch. Not on an individual thing. So in our case, we have to create a mini batch of size 1. So our original image is 3 channels by 352 by 352, height by width. Remember PyTorch is channel by height by width. So I need to create a rank 4 tensor where the first axis is 1. In other words, it's a mini batch of size 1 because that's what PyTorch expects. So there's something you can do in both PyTorch and NumPy, which is you can index into an array or a tensor with a special value, none. And that creates a new unit axis in that point. So T is my image of dimensions 3 by 352 by 352. T none is a rank 4 tensor, a mini batch of 1 image of 1 by 3 by 352 by 352. Now I can go Conv2D and get back my cat, specifically my main coon. So that's how you can play around with convolutions yourself. So how are we going to do this to create a heat map? This is where things get fun. Remember what I mentioned was that I basically have my input, red-green-blue, and it goes through a bunch of convolutional layers. I'll just write a little line to say a convolutional layer to create activations which have more and more channels and eventually less and smaller and smaller height by width. Until eventually, remember we looked at the summary, we ended up with something which was 11 by 11 by 512. There's a whole bunch more layers that we skipped over. Now there are 37 classes, because remember data.c is the number of classes we have, and we can see that at the end here, we end up with 37 features in our model. So that means that we end up with a probability for every one of the 37 breeds of cat and dog. So it's a vector of length 37. That's our final output that we need, because that's what we're going to compare implicitly to our one hot encoded matrix, which will have a 1 in the location for main coon. So somehow we need to get from this 11 by 11 by 512 to this 37. And so the way we do it is we actually take the average of every one of these 11 by 11 faces. We just take the mean. So we're going to take the mean of this first face, take the mean, and that gets us one value. And then we'll take the second of the 512 faces and take that mean, and that'll give us one more value. So we're going to do that for every face, and that will give us a 512 long vector. And so now all we need to do is pop that through a single matrix multiply of 512 by 37, and that's going to give us an output vector of length 37. So this step here where we take the average of each face is called average pooling. So let's go back to our model and take a look. Here it is. Here is our final 512. And here is, we'll talk about what a concap pooling is in Part 2. For now, we'll just focus on this. This is a fast A.I. specialty. Everybody else just does this. Average pool. Average pool duty with an output size of one. So here it is. Average pool 2D with an output size of one. And then, again, there's a bit of a special fast A.I. thing that we actually have two layers here, but normally people then just have the one linear layer with the input of 512 and the output of 37. So what that means is that this little box over here where we want a one for main coon, we've got to have a box over here that needs to have a high value in that place so that the loss will be low. So if we're going to have a high value there, the only way to get it is with this matrix multiplication is that it's going to represent a simple weighted linear combination of all of the 512 values here. So if we're going to be able to say, I'm pretty confident this is a main coon, just by taking the weighted sum of a bunch of inputs, those inputs are going to have to represent features like how fluffy is it? What color is its nose? How long is its legs? How pointy are its ears? All the kinds of things that can be used because for the other thing which figures out is this a bulldog, it's got to use exactly the same kind of 512 inputs with a different set of weights because that's all a matrix multiplication is. It's just a bunch of weighted sums, a different weighted sum for each output. So therefore we know that this potentially dozens or even hundreds of layers of convolutions must have eventually come up with an 11 by 11 face for each of these features saying, in this little bit here, how much is that part of the image like a pointy ear? How much is it fluffy? How much is it like a long leg? How much is it like a very red nose? So that's what all of those things must represent. So each face is what we call, each of these represents a different feature. So the outputs of these we can think of as different features. So what we really want to know then is not so much what's the average across the 11 by 11 to get this set of outputs, but what we really want to know is what's in each of these 11 by 11 spots. So what if instead of averaging across the 11 by 11, let's instead average across the 512. If we average across the 512, that's going to give us a single 11 by 11 matrix. And each item, each grid point in that 11 by 11 matrix will be the average of how activated was that area. When it came to figuring out that this was a Maine Coon, how many signs of Maine Coonishness was there in that part of the 11 by 11 grid? And so that's actually what we do to create our heat map. So I think maybe the easiest way is to kind of work backwards. Here's our heat map. And it comes from something called average activations. And it's just a little bit of matplotlib and fastai. Fastai to show the image and then matplotlib to take the heat map, which we passed in, which was called average activations, HM for heat map. Alpha point six means make it a bit transparent. And matplotlib extent means expand it from 11 by 11 to 352 by 352. Use bilinear interpolations if not all blocky and use a different color map to kind of highlight things. So that's just a matplotlib is not important. The key thing here is that average activations is the 11 by 11 matrix we wanted. Here it is. Average activations dot shape is 11 by 11. So to get there, we took the mean of activations across dimension zero, which is what I just said. In PyTorch, the channel dimension is the first dimension. So the mean across dimension zero took us from something of size 512 by 11 by 11, as promised, to something of 11 by 11. So therefore activations, acts contains the activations we're averaging. Where did they come from? They came from something called a hook. So a hook is a really cool, more advanced PyTorch feature that lets you, as the name suggests, hook into the PyTorch machinery itself and run any arbitrary Python code you want to. It's a really amazing and nifty thing. Because, you know, normally when we do a forward pass through a PyTorch module, it gives us this set of outputs. But we know that in the process, it's calculated these. So what I would like to do is I would like to hook into that forward pass and tell PyTorch, hey, when you calculate this, can you store it for me, please? So what is this? This is the output of the convolutional part of the model. So the convolutional part of the model, which is everything before the average pool, is basically all of that. And so thinking back to transfer learning, remember with transfer learning, we actually cut off everything after the convolutional part of the model and replaced it with our own little bit. So with fast ai, the original convolutional part of the model is always going to be the first thing in the model. And specifically, it's always going to be called, assuming, so in this case, I'm taking my model, and I'm just going to call it m. So you can see m is this big thing, but always, at least in fast ai, always m0 will be the convolutional part of the model. So in this case, we created a, let's go back and see, we created a ResNet 34. So the main part of the ResNet 34, the pre-trained bit we hold on to is an m0, and so this is basically it. This is the printout of the ResNet 34, and at the end of it, there is the 512 activations. So what, in other words, what we want to do is we want to grab m0, and we want to hook its output. So this is a really useful thing to be able to do. So fast ai has actually created something to do it for you, which is literally you say hook, output, and you pass in the PyTorch module that you want to hook the output of. And so most likely the thing you want to hook is the convolutional part of the model, and that's always going to be m0, or learn.model0. So we give that hook a name. Don't worry about this part, we'll learn about it next week. So having hooked the output, we now need to actually do the forward pass. And so remember, in PyTorch, to actually get it to calculate something, which is called doing the forward pass, you just act as if the model is a function. So we just pass in our X mini-batch. So we already had a maincune image called X, but we can't quite pass that into our model. It has to be normalized and turned into a mini-batch and put onto the GPU. So FastAI has a thing called a data bunch, which we have in data, and you can always say data.oneItem to create a mini-batch with one thing in it. And as an exercise at home, you could try to create a mini-batch without using data.oneItem to make sure that you kind of learn how to normalize and stuff yourself if you want to. But this is how you can create a mini-batch with just one thing in it, and then I can pop that onto the GPU by saying dot-cuda. That's what I passed to my model. And so the predictions I get out actually don't care about, right? Because the predictions is this thing, which is not what I want, right? So I'm not actually going to do anything with the predictions. The thing I care about is the hook that I just created. Now, one thing to be aware of is that when you hook something in PyTorch, that means every single time you run that model, assuming you're hooking outputs, it's storing those outputs. And so you want to remove the hook when you've got what you want, because otherwise if you use the model again, it's going to keep hooking more and more outputs, which will be slow and memory-intensive. So we've created this thing. Python calls it a context manager. You can use any hook as a context manager. At the end of that with block, it'll remove the hook. Okay? So we've got our hook. And so now PyTorch hooks, sorry, fastai hooks always give you something called, or at least the output hooks, always give you something called .stored, which is where it stores away the thing you asked it to hook. And so that's where the activations now are. Okay? So we did a forward pass after hooking the output of the convolutional section of the model. We grabbed what it stored. We checked the shape. It was 512 by 11 by 11, as we predicted. We then took the mean of the channel axis to get an 11 by 11 tensor. And then if we look at that, that's our picture. So there's a lot to unpack. Right? A lot to unpack. But if you take your time going through these two sections, the convolution kernel section and the heat map section of this notebook, like running those lines of code and changing them around a little bit. And remember, the most important thing to look at is shape. You might have noticed when I'm showing you these notebooks, I very often print out the shape. And when you look at the shape, you want to be looking at how many axes are there. That's the rank of the tensor. And how many things are there in each axis. And try and think, why? Try going back to the printout of the summary. Try going back to the actual list of the layers. And try and go back and think about the actual picture we drew. And think about what's actually going on. Okay. So that's a lot of technical content. So what I'm going to do now is switch from technical content to something much more important unless we have some questions first. Okay. Because in the next lesson, in the next lesson, we're going to be looking at generative models, both text and image generative models. And generative models are where you can create a new piece of text or a new image or a new video or a new sound. And as you probably are aware, this is the area that deep learning has developed the most in in the last 12 months. So we're now at a point where we can generate realistic-looking videos, images, audio, and to some extent even text. And so there are many things in this journey which have ethical considerations, perhaps this area of generative modeling is one of the largest ones. So before I got into it, I wanted to specifically touch on ethics and data science. Most of the stuff I'm showing you actually comes from Rachel. And Rachel has a really cool TEDx San Francisco talk that you can check out on YouTube. And a more extensive analysis of ethical principles and bias principles in AI, which you can find at this talk here. And she has a playlist that you can check out. We've already touched on an example of bias, which was this gender shades study, where if you remember, for example, lighter male skin people on IBM's main computer vision system, 99.7% accurate. And darker females are some hundreds of times less accurate in terms of error. So like extraordinary differences. And so it's interesting to kind of like, okay, it's first of all important to be aware that not only can this happen technically, that this can happen on a massive company's rolled out, publicly available, highly marketed system that hundreds of quality control people have studied and lots of people are using. It's out there in the world. They all look kind of crazy. And so it's interesting to think about why. And so one of the reasons why is that the data we feed these things. We tend to use, me included, a lot of these data sets kind of unthinkingly. But like ImageNet, which is the basis of like a lot of the computer vision stuff we do, is over half American and Great Britain. Like, when it comes to the countries that actually have most of the population in the world, I can't even see them here. They're somewhere in these impossibly thin lines. Because remember these data sets are being created almost exclusively by people in U.S., Great Britain, and nowadays increasingly also China. So there's a lot of bias in the content we're creating because of a bias in the kind of people that are creating that content, even when in theory it's being created in a very kind of neutral way. But you can't argue with the data, right? It's obviously not neutral at all. And so when you have biased data creating biased algorithms, you then need to say like, what are we doing with that? So we've spent a lot of time talking about image recognition. So a couple of years ago, this company, DeepKlyn, advertised their image recognition system, which can be used to do mass surveillance on large crowds of people. Find any person passing through who is a person of interest in theory. And so putting aside even the question of like, is it a good idea to have such a system, you've got to think, is it a good idea to have such a system where certain kinds of people are 300 times more likely to be misidentified? And then thinking about it, so this is now starting to happen in America, where these systems are being rolled out. And so there are now systems in America that will identify a person of interest in a video and send a ping to the local police. And so these systems are extremely inaccurate and extremely biased. And what happens then, of course, is if you're in a predominantly black neighborhood where the probability of successfully recognizing you is much lower and you're much more likely to be surrounded by black people. And so suddenly all of these black people are popping up as persons of interest, or in a video of a person of interest, or the people in the video are all recognized as in the vicinity of the person of interest, you suddenly get all these pings going off the local police department, causing the police to run down there and therefore likely to lead to a larger number of arrests, which is then likely to feed back into the data being used to develop the systems. So this is happening right now. And so, thankfully, a very small number of people are actually bothering to look into these things. I mean, ridiculously small, but at least it's better than nothing. And so, for example, one of the best ways that people get publicity is to do kind of funny experiments like, let's try the mugshot image recognition system that's being widely used and try it against the members of Congress and find out that there are 28 members of Congress who would have been identified. Identified by this system, obviously, incorrectly. Oh, I didn't know that. Okay, members of... Black members of Congress. Not at all surprised to hear that. Thank you, Rachel. We see this kind of bias in a lot of the systems we use. Not just image recognition, but text translation. When you convert she as a doctor, he as a nurse, into Turkish, you quite correctly get a gender in specific pronoun because that's what Turkish uses. You could then take that and feed it back into Turkish with your gender in specific pronoun and you will now get he as a doctor. She is a nurse. So the bias, again, this is in a massively widely rolled out, carefully studied system. And it's not like even these kind of things like little one-off things that then get fixed quickly. These issues have been identified in Google Translate for a very long time and they're still there and they don't get fixed. So the kind of results of this are, in my opinion, quite terrifying because what's happening is that in many countries, including America where I'm speaking from now, algorithms are increasingly being used for all kinds of public policy, judicial and so forth purposes. For example, there's a system called Compass which is very widely used to decide who's going to jail. And it does that in a couple of ways. It tells judges what sentencing guidelines they should use for particular cases and it tells them also which people, the system says should be let out on bail. But here's the thing. White people, it keeps on saying, let this person out even though they end up reoffending and vice versa. It's systematically like out by double compared to what it should be in terms of getting it wrong with white people versus black people. So this is like kind of horrifying because I mean, amongst other things, the data that it's using in this system is literally asking people questions about things like, did any of your parents ever go to jail? Do any of your friends do drugs? Like they're asking questions about other people who they have no control over. So not only are these systems biased, very systematically biased, but they're also being done on the basis of data which is totally out of your control. So this is kind of, did you want to add something to that? Oh yeah, are your parents divorced is another question that's being used to decide whether you go to jail or not. So when we raise these issues kind of on Twitter or in talks or whatever, there's always a few people, always white men, a few people who will always say like, that's just the way the world is. That's just reflecting what the data shows. But when you actually look at it, it's not. It's actually systematically erroneous and systematically erroneous against people of color, minorities, the people who are less involved in creating the systems that these products are based on. Sometimes this can go a really long way. So for example, in Myanmar, there was a genocide of the Rohingya people and that genocide was very heavily created by Facebook. Not because anybody at Facebook wanted it, I mean Heavens know. I know a lot of people at Facebook, I have a lot of friends at Facebook, they're really trying to do the right thing. They're really trying to create a product that people like. But not in a thoughtful enough way because when you roll out something where literally in Myanmar, a country that most people didn't have, maybe half of people didn't have electricity until very recently and you say, hey, you can all have free internet as long as it's just Facebook, you've got to think carefully about what you're doing. And then you use algorithms to feed people the stuff they will click on. And of course what people click on is stuff which is controversial, stuff that makes their blood boil. So when they actually started asking the generals in the Myanmar Army that were literally throwing babies onto bonfires, they were saying, we know that these are not humans, we know that they are animals because we read the news, we read the internet. And because this is the stories that the algorithms are pushing. And the algorithms are pushing the stories because the algorithms are good. They know how to create eyeballs, how to get people watching and how to get people clicking. And again, nobody at Facebook said, let's cause a massive genocide in Myanmar. They said, let's maximize the engagement of people in this new market on our platform. Right? So they very successfully maximized engagement. Yes. Please. It's just it's important to note people warned executives at Facebook how the platform was being used to incite violence as far back as 2013, 2014, 2015 and 2015. Someone even warned executives that Facebook could be used in Myanmar in the same way that the radio broadcast were used in Rwanda during the Rwandan genocide. And as of 2015, Facebook only had four contractors who spoke Burmese working for them. They really did not put many resources into the issue at all, they were getting very alarming warnings about it. So why does this happen? Part of the issue is that ethics is complicated and you will not find Rachel or I telling you how to do ethics. How do you fix this? We don't know. We can just give you kind of things to think about. Another part of the problem we keep hearing is it's not my problem, I'm just a researcher, I'm just a techie, I'm just building a data set. I'm not part of the problem, I'm part of this foundation that's far enough away that I can imagine that I'm not part of this, right? But, you know, if you're creating ImageNet and you want it to be successful, you want lots of people to use it, you want lots of people to build products on it, lots of people to do research on top of it, if you're trying to create something that people are using, then please try to make it something that won't cause massive amounts of harm and doesn't have massive amounts of bias. And it can actually come back and bite you in the arse, right? The Volkswagen engineer who ended up actually encoding the thing that made them systematically cheat on their diesel emissions tests, on their pollution tests, ended up in jail. Not because it was their decision to cheat on the tests, but because their manager told them to write their code and they wrote the code, and therefore they were at the ones that ended up being criminally responsible and they were the ones that were jailed, right? So if you do in some way a shitty thing that ends up causing trouble, that can absolutely come back around and get you in trouble as well. Sometimes it can cause huge amounts of trouble. So if we go back to World War II, right, then this was one of the first great opportunities for IBM to show off their amazing tabulating system, and they had a huge client in Nazi Germany. And Nazi Germany used this amazing new tabulating system to encode all of the different types of Jews that they had in the country and all the different types of problem people. So Jews were eight, Gypsies were 12, then different outcomes were coded, executions were a four, death in a gas chamber was six. A Swiss judge ruled that IBM was actively involved, facilitating the commission of these crimes against humanity. So there are absolutely plenty of examples of people building data processing technology that are directly causing deaths, sometimes millions of deaths, right? So we don't want to be one of those people. And so you might have thought, oh, you know, I'm just creating some data processing software and somebody else is thinking I'm just a salesperson and somebody else is thinking I'm just the biz dev person opening new markets, but it all comes together, right? So we need to care. And so one of the things we need to care about is getting humans back in the loop, right? And so when we pull humans out of the loop, it's one of the first times that trouble happens. I don't know if you remember, I remember this very clearly, when I first heard that Facebook was firing the human editors that were responsible for basically curating the news that ended up on the Facebook pages. And I've got to say at the time, I thought that's a recipe for disaster because I've seen again and again that humans can be the person in the loop that can realize this isn't right. You know, it's very hard to create an algorithm that can recognize this isn't right. Or else humans are very good at that. And we saw that's what happened, right? After Facebook fired the human editors, the nature of stories on Facebook dramatically changed. You started seeing this proliferation of conspiracy theories and the kind of the algorithms went crazy with recommending more and more controversial topics. And of course that changed people's consumption behavior, causing them to want more and more controversial topics. So one of the really interesting places this comes in, and Kathy O'Neill, who's got a great book called Weapons of Math Destruction. Thank you, Rachel. And many others have pointed out, is that what happens to algorithms is that they end up impacting people. For example, compass sentencing guidelines go to a judge. Now, you can say the algorithm's very good. I mean, in compass's case it isn't. It actually turned out to be about as bad as random because it's a black box and all that. But even if it was very good, you could then say, well, you know, the judge is getting the algorithm. Otherwise, they'd just be getting a person. People also give bad advice. So what? Humans respond differently to algorithms. It's very common, particularly for a human that is not very familiar with the technology themselves, like a judge, to see like, oh, that's what the computer says. The computer looked it up and it figured this out. It's extremely difficult to get a non-technical audience to look at a computer recommendation and come up with a nuanced decision-making process. So what we see is that algorithms are often put into place with no appeals process. They're often used to massively scale up decision-making systems because they're cheap. And then the people that are using the outputs of those algorithms tend to give them more credence than they deserve because very often they're being used by people that don't have the technical competence to judge them themselves. So a great example, right, was here's an example of somebody who lost their health care. And they lost their health care because of an error in a new algorithm that was systematically failing to recognize that there are many people that need help with, was it Alzheimer's? Cerebral Palsy and Diabetes. Thanks, Rachel. And so this system, which had this error that was later discovered, was cutting off these people from the home care that they needed so that cerebral palsy victims no longer had the care they needed. So their life was destroyed, basically. And so when the person that created that algorithm with the error was asked about this specifically said, should they have found a better way to communicate the system, the strength, the failures, and so forth, he said, yeah, I should probably also dust under my bed. That was there. That was the level of interest they had. And this is extremely common. I hear this all the time. And it's much easier to kind of see it from afar and say, okay, after the problems happened, I can see that that's a really shitty thing to say, but it can be very difficult when you're kind of in the middle of it. I just want to say one more thing about that example. And that's that this was a case where it was separate. There was someone who created the algorithm. Then I think different people implemented the software. And this is in use in over half of the 50 states. And then there was also the particular policy decisions made by that state. And so this is one of those situations where nobody felt responsible because the algorithm creators like, oh, no, it's the policy decisions of the state that were bad. And the state can be like, oh, no, it's the ones who implemented the software. And so everyone's just kind of pointing fingers and not taking responsibility. And in some ways maybe it's unfair, but I would argue the person who is creating the data set and the person who is implementing the algorithm is the person best placed to get out there and say, hey, here are the things you need to be careful of and make sure that they're a part of the implementation process. So we've also seen this with YouTube, right? It's kind of similar to what happened with Facebook. And we're now seeing, we've heard examples of students watching the fast AI courses who say, hey, Jeremy and Rachel, watching the fast AI courses, really enjoyed them. And at the end of one of them, the YouTube autoplay fed me across to a conspiracy theory. And what happens is that once the system decides that you like the conspiracy theories, it's going to just feed you more and more. And then what happens is that... Please, go on. Just briefly, you don't even have to like conspiracy theories. The goal is to get as many people hooked on conspiracy theories as possible as what the algorithm is trying to do, kind of whether or not you've expressed interest. Right. And so the interesting thing, again, I know plenty of people involved in YouTube recommendation systems. None of them are wanting to promote conspiracy theories. But people click on them, right? And people share them. And what tends to happen is also people that are into conspiracy theories consume a lot more YouTube media. So it actually is very good at finding a market that watches a lot of hours of YouTube. And then it makes that market watch even more. So this is an example of a feedback loop. And The New York Times is now describing YouTube as perhaps the most powerful radicalizing instrument of the 21st century. Now I can tell you, my friends that worked on the YouTube recommendation system did not think they were creating the most powerful radicalizing instrument of the 21st century. And to be honest, most of them today, when I talk to them, still think they're not. They think it's all bullshit. Not all of them. But a lot of them now are at the point where they just feel like they're the victims here. People are unfairly. They don't get it. They don't understand what we're trying to do. It's very, very difficult when you're right out there in the heart of it. So you've got to be thinking from right at the start what are the possible unintended consequences of what you're working on. And as the technical people involved, how can you get out in front and make sure that people are aware of them? And I just also wanted to say that in particular, many of these conspiracy theories are promoting white supremacy. They're far right ethno-nationalism, anti-science. And I think maybe five or 10 years ago, I would have thought conspiracy theories are a more fringe thing, but we're seeing the huge societal impact it can have for many people to believe these. Yeah. And partly it's you see them on YouTube all the time. It starts to feel a lot more normal, right? So one of the things that people are doing to try to say like how to fix this problem is to explicitly get involved in talking to the people who might or will be impacted by the kind of decision-making processes that you're enabling. So for example, there was a really cool thing recently where literally statisticians and data scientists got together with people who had been inside the criminal system. I had gone through the bail and sentencing process of criminals themselves and talking to the lawyers who worked with them and put them together with the data scientists and actually kind of put together a timeline of how exactly does it work and where exactly are the places that there are inputs and how do people respond to them and who's involved. This is really cool, right? This is the only way for you as a kind of a data product developer to actually know how your data product's going to be working. A really great example of somebody who did a great job here was Evan Estola at Meetup who said, hey, a lot of men are going to our tech meet-ups and if we use a recommendation system naively, it's going to recommend more tech meet-ups to men which is going to cause more men to go to them and then when women do try to go, they'll be like, oh my God, there's so many men here which is going to cause more men to go to the tech meet-ups. Yeah, so showing recommendations to men and therefore not showing them to women. Yes, yeah. So what Evan and Meetup decided was to make an explicit product decision that this would not even be representing the actual true preferences of people. It would be creating a runaway feedback loop. So let's explicitly stop it before it happens and not recommend less tech meet-ups to women and more tech meet-ups to men. So I think it's really cool. It's like saying we don't have to be slaves to the algorithm. We actually get to decide. Another thing that people can do to help is regulation. And normally when we kind of talk about regulation, there's a natural reaction of like, how do you regulate these things? That's ridiculous. You can't regulate AI. But actually when you look at it again and again and this fantastic paper called Data Sheets for Data Sets has lots of examples of this. There are many, many examples of industries where people thought they couldn't be regulated. People thought that's just how it was. Like cars, people died in cars all the time because they literally had sharp metal knobs on dashboards. Steering columns weren't collapsible. And all of the discussion in the community was that's just how cars are. And when people die in cars, it's because of the people. But then eventually the regulations did come in. And today driving is dramatically safer. Like dozens and dozens of times safer than it was before. So often there are things we can do through policy. So to summarize, we are part of the 0.3 to 0.5% of the world that knows how to code. We have a skill that very few other people do. Not only that, we now know how to code deep learning algorithms which is like the most powerful kind of code I know. So I'm hoping that we can explicitly think about like at least not making the world worse and perhaps explicitly making it better. And so why is this interesting to you as an audience in particular? And that's because FastAI in particular is trying to make it easy for domain experts to use deep learning. And so this picture of the goats here is an example of one of our international fellows from the previous course who was a goat dairy farmer and told us that they were going to use deep learning on their remote Canadian island to help study utter disease in goats. And to me, this is a great example of like a domain experts problem which nobody else even knows about, let alone know that it's a computer vision problem that can be solved with deep learning. So in your field, whatever it is, you probably know a lot more now about the opportunities in your field to make it a hell of a lot better than it was before. You'll probably be able to come up with all kinds of cool product ideas, maybe build a startup or create a new product group in your company or whatever, but also please be thinking about what that's going to mean in practice and think about where can you put humans in the loop, where can you put those pressure release valves, who are the people you can talk to who could be impacted, who could help you understand and get the kind of humanities folks involved who understand history and psychology and sociology and so forth. So that's our plea to you. If you've got this far, you're definitely at a point now where you're ready to make a serious impact on the world. So I hope we can make sure that that's a positive impact. See you next week.