 So, let's just, yeah, let's start by reviewing kind of what we've learnt about optimizing multi-layer functions with SGT. And so the idea is that we've got some data. And then we do something to that data. For example, we multiply it by a weight matrix. And then we do something to that. For example, we put it through a softmax or a sigmoid. And then we do something to that, such as do a cross-entropy loss or a root-mean-squared error loss. Okay. And that's going to, like, give us some scalar. So, this is going to have no hidden layers. This has got a linear layer, a nonlinear activation being a softmax and a loss function being a root-mean-squared error or a cross-entropy. And then we've got our input data, input, linear, nonlinear loss. So, for example, if this was sigmoid or softmax and this was cross-entropy, then that would be logistic regression. So, it's still a, yeah, cross-entropy, yeah, let's do that next. Sure. For now, think of it like, think of root-mean-squared error, same thing, some loss function, okay, for now, we'll look at cross-entropy again in a moment. So how do we calculate the derivative of that with respect to our weights, right? So really, it would probably be better if we said x, w here, because it's really a function of the weights as well. And so we want the derivative of this with respect to our weights, sorry, I put it in the wrong spot, g f of x, w, okay, I just screwed up, that's all, that's why that didn't make sense. All right, so to do that, we just basically, we do the chain rule, so we just say that this is equal to h of u and u equals g f, well, u equals g of v and v equals f of x, so we can just rewrite it like that, right, and then we can do the chain rule, so we can say that's equal to h dash, the derivative is h dash u by g dash v by f dash x, happy with all that so far, okay. So in order to take the derivative with respect to the weights, therefore, we just have to calculate that derivative with respect to w using that exact formula. So if we had in there, yeah, yeah, so d of all that, d w would be that, yeah. So then if we, you know, went further here and had like another linear layer, right, let's give us a bit more, another linear layer, a comma, I don't know, w2, right, so we have another linear layer, there's no difference to now calculate the derivative with respect to all of the parameters, we can still use the exact same chain rule, right. So don't think of the multi-layer network as being like things that occur at different times, it's just a composition of functions, and so we just use the chain rule to calculate all the derivatives at once, you know, there's just a set of parameters that happen to appear in different parts of the function, but the calculus is no different, so to calculate this with respect to w1 and w2, you know, it's just, you just increase, you know, w, you can just now just call it w and say w1 is all of those weights. So the result, ah, that's a great question, so what you're going to have then is a list of parameters, right, so here's w1 and like, it's probably some kind of higher rank tensor, you know, like if it's a convolutional layer, it'll be like a rank 3 tensor or whatever, but we can flatten it out, right, we'll just make it a list of parameters, there's w1, there's w2, right, it's just another list of parameters, right, and here's our loss, which is a single, you know, a single number, so therefore our derivative is just a vector of that same length, right, it's how much does changing that value of w affect the loss, how much does changing that value of w affect the loss, right, so you can basically think of it as a function like, you know, y equals ax1 plus bx2 plus c, right, and say like, oh, what's the derivative of that with respect to a, b and c, and you would have three numbers, the derivative with respect to a and b and c, and that's all this is, right, it's the derivative with respect to that weight, and that weight, and that weight, and that weight. To get there, inside the chain rule, we had to calculate, and I'm not going to go to detail here, but we had to calculate like, Jacobians, so like the derivative when you take a matrix product is you've now got something where you've got like a weight matrix, and you've got an input vector, these are the activations from the previous layer, right, and you've got some new output activations, right, and so now you've got to say like, okay, for this particular, sorry, for this particular weight, how does changing this particular weight change this particular output, and how does changing this particular weight change this particular output, and so forth, so you kind of end up with these higher dimensional tensors showing like for every weight, how does it affect every output, right, but then by the time you get to the loss function, the loss function is going to have like a mean or a sum or something, so they're all going to get added up in the end, you know, and so this kind of thing like, I don't know, it drives me a bit crazy to try and calculate it out by hand or even think of it step by step, because you tend to have like, you just have to remember for every input in a layer, for every output in the next layer, you know, you're going to have to calculate for every weight, for every output, you're going to have to have a separate gradient. One good way to look at this is to learn to use PyTorchers like dot grad attribute and dot backward method manually, and like look up the PyTorch tutorials, and so you can actually start setting up some calculations with a vector input and a vector output, and then type dot backward, and then say type dot grad, and like look at it, and then do some really small ones with just two or three items in the input and output vectors, and like make the operation like plus two or something, and like see what the shapes are, make sure it makes sense. Yeah, because it's kind of like this vector matrix calculus is not like, introduces zero new concepts to anything you learn in high school, like strictly speaking, but getting a feel for how these shapes move around, I find talk a lot of practice, you know. The good news is you almost never have to worry about it. Okay, so we were talking about then using this kind of logistic regression for NLP, and before we got to that point, we were talking about using naive base for NLP, and the basic idea was that we could take a document, a review, like this movie is good, and turn it into a bag of words representation consisting of the number of times each word appears, and we call this the vocabulary, this is the unique list of words. And we used the sklearn count vectorizer to automatically generate both the vocabulary, which in sklearn they call the features, and to create the bag of words representations, and the whole group of them then is called a term document matrix. Okay, and we kind of realized that we could calculate the probability that a positive review contains the word this by just averaging the number of times this appears in the positive reviews, and we could do the same for the negatives, right, and then we could take the ratio of them to get something which if it's greater than one was a word that appeared more often in the positive reviews or less than one was a word that appeared more often in the negative reviews. Okay, and then we realized using base rule and taking the logs that we could basically end up with something where we could add up the logs of these, plus the log of the ratio of the probabilities that things are in class one versus class zero, and end up with something we can compare to zero, if it's a bit greater than zero then we can predict a document is positive or if it's less than zero we can predict the document is negative, and that was our base rule. So we kind of did that from math first principles and I think we agreed that the naive in naive base was a good description because it assumes independence but it's definitely not true. But it was an interesting starting point and I think it was interesting to observe when we actually got to the point where like, okay, now we've calculated the ratio of the probabilities and took the log and now rather than multiply them together of course we have to add them up and when we actually wrote that down we realized like, oh, that is, you know, just a standard weight matrix product plus a bias, right, and so then we kind of realized like, oh, okay, so like if this is not very good, accuracy, 80% accuracy, why not improve it by saying, hey, we know other ways to calculate a bunch of coefficients and a bunch of biases which is to learn them in a logistic regression, right, so in other words this is the formula we use for a logistic regression and so why don't we just create a logistic regression and fit it, okay, and it's going to be give us the same thing but rather than coefficients and biases which are theoretically correct based on, you know, this assumption of independence and based on Bayes' rule there'll be the coefficients and biases that are actually the best in this data, right, so that was kind of where we got to and so the kind of key insight here is like just about everything I find a machine learning ends up being either like a tree or, you know, a bunch of matrix products and monomerities, right, like it's everything seems to end up kind of coming down to the same thing including as it turns out Bayes' rule, right, and then it turns out that nearly all of the time then whatever the parameters are in that function nearly all the time it turns out that they better learnt than calculated based on theory, right, and indeed that's what happened when we actually tried learning those coefficients we got, you know, 85%. So then we noticed that we could also rather than take the whole term document matrix we could instead just take them the, you know, ones and zeros for presence or absence of a word and, you know, sometimes it was, you know, this equally as good but then we actually tried something else which is we tried adding regularization and with regularization the binarized approach turned out to be a little better, right, so then regularization was where we took the loss function and again let's start with RMSE and then we'll talk about cross entropy. Loss function was our predictions minus our actuals, sum that up, take the average plus a penalty, okay, and so this specifically is the L2 penalty. If this instead was the absolute value of W then that would be the L1 penalty, okay. We also noted that we don't really care about the loss function per se, we only care about its derivative, that's actually the thing that updates the weights, so we can, because this is a sum, we can take the derivative of each part separately and so the derivative of this part was just that, right, and so we kind of learnt that even though these are mathematically equivalent, they have different names, this version is called weight decay and it's kind of what's used, that term is used in the neural net literature, okay. So cross entropy on the other hand, you know, it's just another loss function like root mean squared error but it's specifically designed for classification, right, and so here's an example of binary cross entropy, so let's say this is our, you know, is it a cat or a dog, so let's just say is cat, one or a zero, so cat, cat, dog, dog, cat and these are our predictions, this is the output of our final layer of our neural net or our logistic regression or whatever, all right, then all we do is we say okay let's take the actual times the log of the prediction and then we add to that one minus actual times the log of one minus the prediction and then take the negative of that whole thing, all right. So I suggested to you all that you tried to kind of write the if statement version of this, so hopefully you've done that by now, otherwise I'm about to spoil it for you, so this was y times log y plus 1 minus y times log 1 minus y, right, and negative of that, okay, so who wants to tell me how to write this as an if statement? Chen Xi, hit me, I'll give a try, so if y equal to, sorry, if y equal to 1 then return log y, otherwise, well, else return log 1 minus 1, good, so that's the thing in the brackets and it takes a minus, good, so the key insight Chen Xi's using is that y has two possibilities, one or zero, okay, and so very often the math can hide the key insight, which I think happens here until you actually think about what the values it can take, right, so that's all it's doing, it's saying either give me that or give me that, right, could you pass that to the back please Chen Xi? Sorry if I'm missing something, but do you not need two variables in that statement, because you've got y, so it'd be like y hat and the y or something. Oh, yeah, thank you, as usual it's me missing something, okay, and so then the multi-category version is just the same thing, but you're saying it for different, more than just y equals 1 or 0, but y equals 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, for instance, okay. And so that, you know, that loss function has a, you can figure it out yourself and particularly simple derivative, and it also, you know, another thing you could play with at home if you like is like thinking about how the derivative looks when you add a sigmoid or a softmax before it, you know, it turns out at all, it turns out very nicely because you've got an xb thing going into a loggy thing, so you end up with, you know, very well-behaved derivatives. The reason, I guess, there's lots of reasons that people use RMSE for regression and cross-entropy for classification, but most of it comes back to this statistical idea of a best linear unbiased estimator, you know, and based on the likelihood function, it kind of turns out that these have some nice statistical properties. It turns out, however, in practice, root mean squared error in particular, the properties are perhaps more theoretical than actual, and actually nowadays using the absolute deviation rather than the sum of squared deviation can often work better. So in practice, like everything in machine learning, I normally try both for a particular data set. I'll try both loss functions and see which one works better, and thus, of course, it's a Kaggle competition, in which case you're told how Kaggle's going to judge it, and you should use the same loss function as Kaggle's evaluation metric. All right, so, yeah, so this is really the key insight, it's like, hey, let's not use theory, but instead learn things from the data, and, you know, we hope that we're going to get better results, particularly with regularization, we do, and then I think the key regularization insight here is, hey, let's not, like, try to reduce the number of parameters in our model, but instead, like, use lots of parameters and then use regularization to figure out which ones are actually useful, all right, and so then we took that a step further by saying, hey, given we can do that with regularization, let's create lots more features by adding bigrams and trigrams, you know, bigrams like by vast and by vengeance, and trigrams like by vengeance, full stop, and by Vera Miles, all right, and, you know, just to keep things a little faster, we limited it to 800,000 features, but, you know, even with the full 70 million features, it works just as well, and it's not a hell of a lot slower, so we created a term document matrix, again, using the full set of n-grams for the training set, the validation set, and so now we can go ahead and say, okay, our labels is the training set, labels is before, our independent variables is the binarized term document matrix as before, and then let's fit a logistic regression to that, and do some predictions, and we get 90% accuracy, so this is looking pretty good, okay, so the logistic regression, let's go back to our naive base, right, in our naive base, we have this term document matrix, and then for every feature, we're calculating the probability of that feature occurring if it's class one, that probability of that feature occurring if it's class two, and then the ratio of those two, right, and in the paper that we're actually basing this off, they call this P, this Q, and this R, right, maybe I should just fill that in, P, Q, maybe then we'll say probability to make it more obvious, okay, and so then we kind of said, hey, let's not use these ratios as the coefficients in that matrix multiply, but let's instead try and learn some coefficients, you know, so maybe start out with some random numbers, you know, and then try and use stochastic gradient descent to find slightly better ones, so you'll notice some important features here, the R vector is a vector of rank one, and its length is equal to the number of features, and of course our logistic regression coefficient matrix is also of rank one, and length equal to the number of features, right, and we're saying like there are kind of two ways of calculating the same kind of thing, right, one based on theory, one based on data, so here is like some of the numbers in R, right, remember it's using the log, so these numbers which are less than zero represent things which are more likely to be negative, and these ones that here are more likely, sorry, this one here is more likely to be positive, and so here's either the power of that, and so these are the ones we can compare to one rather than to zero, so I'm going to do something that hopefully is going to seem weird, and so first of all I'm going to talk about, I'm going to say what we're going to do, and then I'm going to try and describe why it's weird, and then we'll talk about why it may not be as weird as we first thought, so here's what we're going to do, we're going to take our term document matrix, and we're going to multiply it by R, so what that means is we're going to, we can do it here in Excel, right, so we're going to say let's grab everything in our term document matrix and multiply it by the equivalent value in the vector of R, right, so this is like a broadcasted element-wise multiplication, not a matrix multiplication, okay, and that's what that does, okay, so here is the value of the term document matrix times R, in other words everywhere that a zero appears there, a zero appears here, and every time a one appears here the equivalent value of R appears here, so we haven't really, we haven't really changed much, right, we've just, we've just kind of changed the ones into something else, into the R's from that feature, right, and so what we're now going to do is we're going to use this as our independent variables instead, in our logistic regression, okay, so here we are, xnb, x naive Bayes version, is x times R, and now let's do a logistic regression, fitting using those independent variables, and let's then do that for the validation set, okay, and get the predictions, and, lo and behold, we have a better number, okay, so let me explain why this hopefully seems surprising, given that we're just multiplying, oh, I picked out the wrong ones, I should have said R, not cof, okay, that's actually R, I got the wrong number, okay, so that's our independent variables, right, and then the logistic regression has come up with some set of coefficients, let's pretend for a moment that these are the coefficients that it happened to come up with, right, we could now say, well, let's not use this set, let's not use this set of independent variables, but let's use the original binarized feature matrix, right, and then divide all of our coefficients by the values in R, and we're going to get exactly the same result mathematically, so, you know, we've got our x naive Bayes version of the independent variables, and we've got some set of weights, some set of coefficients, so I call it w, right, w1, let's say, where it's found like this is a good set of coefficients to making our predictions from, right, but x and b is simply equal to x times, as in element y's times R, right, so in other words, this is equal to x times R times the weights, and so like we could just change the weights to be that, right, and get the same number, so this ought to mean that the change that we made to the dependent variable shouldn't have made any difference, because we can calculate exactly the same thing without making that change, so that's the question, why did it make a difference? So in order to answer this question, I'm going to try and get you all to try and think about this, in order to answer this question, you need to think about like, okay, what are the things that aren't mathematically the same, why is it not identical, what are the reasons like, come up with some hypotheses, what are some reasons that maybe we've actually ended up with a better answer, and to figure that out we need to first of all start like, why is it even a different answer, why is that different to that, it's a subtle, all right, what do you think? I was just wondering if there was two different kinds of multiplications, you said that one is the element y's multiplication, no they do end up mathematically being the same, okay, pretty much, there's a minor wrinkle, but it's not that, it's not some order operations thing, let's try, can't you, you are on a roll today, so let's see how you go, I feel like the features are less correlated to each other, I mean, I've made a claim that these are mathematically equivalent, so, so what are you saying really, you know, why are we getting different answers, it's good, keep on coming up with hypotheses, we need lots of wrong answers before we start finding the right ones, it's like that, you know, warmer, hotter, colder, you know, Ernest, you gonna get us hotter? Does it have anything to do with the regularization? Yes, and is it the fact that when you... So let's start there, right, so Ernest's point here is like, okay, Jeremy, you've said they're equivalent, but they're equivalent outcomes, right, but you went through a process to get there, and that process included regularization, and they're not necessarily equivalent regularization, like our loss function has a penalty, so yeah, help us think through Ernest how much that might impact things. Well, this is maybe kind of dumb, but I'm just noticing that the numbers are bigger in the ones that have been weighted by the naive phase, our weights, and so these are bigger, and some are smaller, some are bigger, but there are some bigger ones, like the variance between the columns is much higher now. The variance is bigger, yeah, I think that's a very interesting insight. Okay, that's all I got. Okay, so build on that. Prince has been on a roll a month, so hit us. Actually, I'm not sure, but is it also like considering the dependency of different words, is it why it is forming better, rather than all words independent of each other? Not really. I mean, it's, you know, again, you know, theoretically, these are creating mathematically equivalent outputs. So they're not, they're not doing something different, except, as Ernest mentioned, they're getting impacted differently by regularization. So what's, so what's regularization, right? Regularization is we start out with our, that was the weirdest thing. I forgot to go into screenwriting mode, and it just turns out that you can actually write an Excel, and I had no idea that was true. I still use screenwriting mode, so I don't want to kill up my spreadsheet. I just never tried. So our loss was equal to, like, our cross entropy loss, you know, based on the predictions, or the predictions and the actuals, right, plus our penalty. So if your, if your weights are large, right, then that piece gets bigger, right? And it drowns out that piece, right? But that's actually the piece we care about, right? We actually want it to be a good fit. So we want to have as little regularization going on as we can get away with. We want, so we want to have less weights. So here's the thing, right? Our value, yes, can you pass it over here? When you say less weights, do you mean lesser weights? I do, yeah, yeah. And I kind of use the two words a little equivalently, which is not quite fair, I agree. But the idea is that weights that are pretty close to zero are kind of not there. So here's the thing, our values of R, you know, and I'm not a Bayesian weenie, but I'm still going to use the word prior, right? They're kind of like a prior. So like, we think that the, the different levels of importance and positive or negative of these different features might be something like that, right? We think that like bad, you know, might be more correlated with negative than, than good, right? So our kind of implicit assumption before was that we have no priors. So in other words, when we'd said squared weights, we're saying a nonzero weight is something we don't want to have, right? But actually I think what I really want to say is that differing from the naive Bayes expectation is something I don't want to do, right? Like only vary from the naive Bayes prior unless you have good reason to believe otherwise, right? And so that's actually what this ends up doing, right? We end up saying, you know what? We think this value is probably free, right? And so if you're going to like make it a lot bigger or a lot smaller, right? That's going to create the kind of variation in weights that's going to cause that squared term to go up, right? So, so if you can, you know, just leave all these values about similar to where they are now, right? And so that's what the penalty term is now doing, right? The penalty term when our inputs is already multiplied by r is saying, penalize things where we're varying it from our naive Bayes prior. Can you pass that? Why multiply only with the r? Not constant like r squared or something like that, when the variance would be much higher this time? Because our prior comes from an actual theoretical model, right? So I said, like, I don't like to rely on theory, but I have, if I have some theory, then, you know, maybe we should use that as our starting point rather than starting off by assuming everything's equal. So our prior said, hey, we've got this model called naive Bayes and the naive Bayes model said, if the naive Bayes assumptions were correct, then r is the correct coefficient, right? In this specific formulation. That, that's why we pick that because our, our prior is based on that, that theory. Okay. So this is a really interesting insight which I never really see covered, which is this idea is that we can use these like, you know, traditional machine learning techniques. We can imbue them with this kind of Bayesian sense by, by starting out, you know, incorporating our theoretical expectations into the data that we give our model, right? And when we do so, that then means we don't have to regularize as much. And that's good, right? Because if we regularize a lot, let's try it. Let's go back to, you know, here's our, remember the way they do it in the S. K. Lerner logistic regression is this is the reciprocal of the amount of regularization penalty. So we'll kind of add lots of regularization by making it small. So that like really hurts. That really hurts our accuracy because now it's trying really hard to get those weights down. The loss function is overwhelmed by the need to reduce the weights and the need to make it predictive is kind of now seems totally unimportant, right? So, so by kind of starting out and saying, you know what, don't push the weights down so that you end up ignoring the terms, but instead push them down so that you try to get rid of, you know, ignore differences from our expectation based on the naive phase formulation. So that ends up giving us a very nice result which actually was originally, this technique was originally presented I think about 2012. Chris Manning, who's a terrific NLP researcher up at Stanford and Cedar Wang who I don't know, but I assume is awesome because his paper is awesome, they basically came up with this idea. And what they did was they compared it to a number of other approaches on a number of other data sets. So one of the things they tried is this one, is the IMDB data set, right? And so here's naive phase SVM on diagrams. And as you can see, this approach outperformed the other linear based approaches that they looked at and also some restricted Boltzmann machine kind of neural net based approaches they looked at. Now, nowadays, there are better ways, there are, you know, there are better ways to do this. And in fact, in the deep learning course, we showed a new state of the art result that we just developed at fast AI that gets well over 94%. But still, you know, like particularly for a linear technique that's easy, fast and intuitive, this is pretty good. And you'll notice when they did this, they only used diagrams. And I assume that's because they, I looked at their code and it was kind of pretty slow and ugly. You know, I figured out a way to optimize it a lot more as you saw. And so we were able to use here, trigrams, and so we get quite a lot better. So we've got 91.8 versus 91.2. But other than that, it's identical. Also, I mean, they used a support vector machine, which is almost identical to a logistic regression in this case. So there's some minor differences, right? So I think that's a pretty cool result. And, you know, I will mention, you know, what you get to see here in class is the result of like many weeks and often many months of research that I do. And so I don't want you to think like this stuff is obvious. It's not at all like reading this paper. There's no description in the paper of like why they use this model, how it's different, why they thought it works, you know, it took me a week or two to even realize that it's kind of like mathematically equivalent to a normal logistic regression. And then a few more weeks to realize that the difference is actually in the regularization. You know, like this is kind of like machine learning, as I'm sure you've noticed from the Kaggle competitions, you enter, you know, like you come up with a thousand good ideas, 999 of them, no matter how confident you are, they're going to be great. They always turn out to be shit, you know. And then finally, after four weeks, one of them finally works and kind of gives you the enthusiasm to spend another four weeks of misery and frustration. This is the norm, right? And like for sure, the best practitioners I know in machine learning all share one particular trait in common, which is they're very, very tenacious, you know, also known as stubborn and bloody-minded, right? Which is definitely a reputation I seem to have, probably fair, along with another thing which is that they're all very good coders, you know, they're very good at turning their ideas into code. So, yeah. So, you know, this was like a really interesting experience for me working through this a few months ago to try and like figure out how to at least, you know, how to explain why this at the time kind of state-of-the-art result exists. And so, once I figured that out, I was actually able to build on top of it and make it quite a bit better. And I'll show you what I did. And this is where it was very, very handy to have high torch at my disposal because I was able to kind of create something that was customized just the way that I wanted to be and also very fast by using the GPU. So, here's the kind of fast AI version of the NBSVM. Actually, my friend Stephen Merity, who's a terrific researcher in NLP, has christened this, the NBSVM++, which I thought was lovely. So, here's the, even though there is no SVM, it's a logistic progression, but as I said, nearly exactly the same thing. So, let me first of all show you like the code. So, this is like, we try to like, once I figure out like, okay, this is like the best way I can come up with to do a linear bag of words model, I kind of embedded into fast AI so you can just write a couple lines of code. So, the code is basically, hey, I want to create a data class for text classification. I want to create it from a bag of words. Here is my bag of words. Here are my labels. Here is the same thing for the validation set and use up to 2,000 unique words per review, all right, which is plenty. So, then from that model data construct a learner, which is kind of the fast AI generalization of a model, which is based on a dot product of naive Bayes, and then fit that model, and then do a few epochs. And after five epochs, I was already up to 92.2. So, this is now like, you know, getting quite well above this linear Bayes line. So, let me show you the code for that. So, the code is like, horrifyingly short. That's it, right? And it'll also look on the whole, extremely familiar, right? If there's a few tweaks here, pretend this thing that says embedding, pretend it actually says linear. Okay, I'm going to show you embedding in a moment, pretend it says linear. So, we've got basically a linear layer where the number of features coming with, the number of features as the rows, and remember sklearn features means number of words, basically. And then for each row, we're going to create one weight, which makes sense, right? For like a logistic regression, every, every, so not for each row, for each word. Each word has one weight. And then, we're going to be multiplying it by the R values. So, for each word, we have one R value per class. So, I actually made this so this can handle like, not just positive versus negative, but maybe figuring out like which author created this work. There could be five or six authors, whatever, right? And basically, we kind of use those linear layers to get the value of the weight and the value of the R. And then, we take the weight times the R and then sum it up. And so, that's just a dot product. Okay? So, just a simple dot product, just as we would do for any logistic regression, and then do the softmax. So, the very minor tweak that we add to get the better result is this, the main one really is this here, this plus something, right? And the thing I'm adding is, it's a parameter, but I pretty much always use this version, this value, 0.4. So, what does this do? So, what this is doing is it's again kind of changing the prior, right? So, if you think about it, even once we use this R times the term document matrix as their independent variables, you really want to start with a question, okay, the penalty terms are still pushing W down to 0, right? So, what did it mean for W to be 0, right? So, what would it mean if we had, you know, coefficients 0, 0, 0, 0, 0, right? So, what that would do when we go, okay, this matrix times these coefficients, we still get 0, right? So, weight of 0 still ends up saying, I have no opinion on whether this thing is positive or negative. On the other hand, if they were all 1, right, then it basically says, my opinion is that the naive Bayes coefficients are exactly right, okay? And so, the idea is that I said 0 is almost certainly not the right prior, right? We shouldn't really be saying if there's no coefficient, it means ignore the naive Bayes coefficient. 1 is probably too high, right? Because we actually think that naive Bayes is only kind of part of the answer, right? And so, I played around with a few different data sets where I basically said, take the weights and add to them some constant, right? And so, 0 would become, in this case, 0.4, right? So, in other words, the regularization penalty is pushing the weights not towards 0, but towards this value, right? And I found that across a number of data sets, 0.4 works pretty well, right? And it's pretty resilient, right? So, again, this is the basic idea is to kind of like get the best of both worlds, you know? We're learning from the data using a simple model, but we're incorporating, you know, our prior knowledge as best as we can. And so, it turns out when you say, okay, let's tell it, you know, a weight matrix of 0s actually means that you should use about, you know, about half of the r values. That ends up working better than the prior that the weights should all be 0. Yes? Is the weights, the w, is it the point for the amount of regularization required? The amount of, so we have this term where the, we have the term where we reduce the amount of error, the prediction error, RMSE, plus we have the regularization. And is the w the 0.4 denote the amount of regularization required? So, w are the weights, right? So, this is calculating our activations, okay? So, we calculate our activations as being equal to the weights times the r sum, right? So, that's just our normal linear function, right? So, the thing which is being penalized is my weight matrix. That's what gets penalized. So, by saying, hey, you know what, don't just use w, use w plus 0.4. So, that's not being penalized. It's not part of the weight matrix, okay? So, effectively, the weight matrix gets 0.4 for free. So, by doing this, even after regularization, then every, I'm sorry, every feature is getting some form of weight, right? Some form of minimum weight or something like that. Not necessarily, because it could end up choosing a coefficient of negative 0.4 for a feature. And so, that would say, you know what, even though Naive Bayes says it's the r should be whatever for this feature, I think you should totally ignore it. Yeah, great questions. Okay. We started at 20 past two. Okay. Let's take a break for about eight minutes or so and start back about 25 to four. Okay. So, a couple of questions at the break. The first was just for a kind of reminder or a bit of a summary as to what's going on here, right? And so, here we have w plus, I'm writing it out, yeah, plus adjusted weight, a weight adjustment times r, right? So, normally what we were doing, so normally what we were doing is saying, hey, logistic regression is basically wx, right? I'm going to ignore the bias. Okay. And then we were changing it to be w dot times x, right? And then we were kind of saying, let's do that bit first, right? Although in this particular case, actually, now I look at it, I'm doing it in this code, it doesn't matter obviously, in this code I'm actually doing, I'm doing this bit first. And so, this thing here actually, I called it w, which is probably pretty bad, it's actually w times x, right? So instead of w times x times r, I've got w times x plus a constant times r, right? So the key idea here is that regularization, can't draw in yellow, that's fair enough, regularization wants the weights to be zero, right? Because it's trying to reduce that. And so what we're saying is like, okay, we want to push the weights towards zero, because we're saying that's our default starting point expectation is the weights are zero. And so we want to be in a situation where if the weights are zero, then we have a model that like makes theoretical or intuitive sense to us, right? This model, if the weights are zero, doesn't make intuitive sense to us, right? Because it's saying, hey, multiply everything by zero, gets rid of all of that, and gets rid of that as well. And we were actually saying, no, we actually think r is useful, we actually want to keep that, right? So instead, we say, you know what, let's take that piece here and add 0.4 to it, right? So now if the regularizer is pushing the weights towards zero, then it's pushing the value of this sum towards 0.4, right? And so therefore it's pushing a whole model to 0.4 times r, right? So in other words, our kind of default starting point, if you've regularized all the weights out all together, is to say, yeah, you know, let's use a bit of r, that's probably a good idea, okay? So that's the idea, right? That's the idea is basically, you know, what happens when that's zero, right? And you want that to be something sensible, because otherwise regularizing the weights to move in that direction wouldn't be such a good idea. Second question was about n-grams. So the n in n-gram can be uni, bi, tri, whatever, one, two, three, whatever, grams. So this movie is good, right? It has four unigrams. This movie is good. It has three bi-grams. This movie, movie is, is good. It has two tri-grams. This movie is, movie is good, okay? Can you pass that? So, yeah, it's okay. Do you mind going back to the WADJ stuff, the 0.4 stuff? Yeah. So I was wondering if this adjustment will harm the predictability of the model, because think of extreme case, if it's not 0.4, if it's 4,000, and all coefficients will be, like, essentially. So, exactly. So, our prior needs to make sense. And so our prior here, and you know, this is why it's called dot prod nb, is our prior is that this is something where we think naive Bayes is a good prior, right? And so naive Bayes says that r equals p over, that's not how you write p, p over q, I have not had much sleep, p over q is a good prior. And not only do we think it's a good prior, but we think r times x plus b is a good model. That's, that's the naive Bayes model. So in other words, we expect that, you know, a coefficient of 1 is a good coefficient, not, not 4,000. So we think, specifically, we don't think, we think 0 is probably not a good coefficient, right? But we also think that maybe the naive Bayes version is a little overconfident, so maybe 1's a little high. So we're pretty sure that the right number, assuming that our model, our naive Bayes model is as appropriate, is between 0 and 1. No, but what I was thinking is as long as it's not 0, you are pushing those coefficients that are supposed to be 0 to something not 0 and make the like high coefficients less distinctive from the coefficients. Well, but you see, they're not supposed to be 0. They're supposed to be r. Like that's, that's what they're supposed to be. They're supposed to be r, right? And so, and remember, this is inside our forward function. So this is part of what we're taking the gradient of, right? So it's basically saying, okay, we're still going to, you know, you can still set self.w to anything you like. But just the regularizer wants it to be 0. And so all we're saying is, okay, if you want it to be 0, then I'll try to make 0 be, you know, give a sensible answer. That's the basic idea. And like, yeah, nothing says 0.4 is perfect for every data set. I've tried a few different data sets and found various numbers between 0.3 and 0.6 that are optimal, but I've never found one where 0.4 is less good than 0, which is not surprising. And I've also never found one where 1 is better, right? So the idea is like, this is a reasonable default, but it's another parameter you can play with, which I kind of like, right? It's another thing you could use grid search or whatever to figure out from your data set what's best. And, you know, really the key here being every model before this one, as far as I know, has implicitly assumed it should be 0, because they just, they don't have this parameter, right? And, you know, by the way, I've actually got a second parameter here as well, which is the same thing I do to R is actually divide R by a parameter, which I'm not going to worry too much about it now, but again, it's another parameter you can use to kind of adjust what the nature of the regularization is. You know, and I mean, in the end, I'm an empiricist, not a theoretician, you know, I thought this seemed like a good idea. Nearly all of my things that seem like a good idea turn out to be stupid. This particular one, Dave Good results, you know, on this data set and a few other ones as well. Okay, could you pass that? You will start there. Yeah. Yeah, I'm sure a little bit confused about the W plus W adjusted. So you mentioned that we do W plus W adjusted so that the coefficients don't get set to 0, that we place some importance on the priors. But you also said that the effect of learning can be that W gets set to a negative value, which effectively turns W plus W to 0. So if we are allowing the learning process to indeed set the priors to 0. So why is that in any way different from just having W? Yeah, great question. Because of regularization, because we're penalizing it by that. So in other words, we're saying, you know what, if the best thing to do is to ignore the value of R, that'll cost you. You're going to have to set W to a negative number. So only do that if that's clearly a good idea. Unless it's clearly a good idea, then you should leave it where it is. That's the only reason. Like all of this stuff we've done today is basically entirely about, you know, maximizing the advantage we get from regularization and saying regularization pushes us towards some default assumption. And nearly all of the machine learning literature assumes that default assumption is everything's zero. And I'm saying like it turns out, you know, it makes sense theoretically and turns out empirically that actually you should decide what your default assumption is. And that will give you better results. So would it be right to say that in a way you're putting an additional hurdle in the, along the way towards getting all coefficients to zero? So it will be able to do that if it is really worth it. Yeah, exactly. So I'd say like the default hurdle without this is making a coefficient non-zero is the hurdle. And now I'm saying, no, the hurdle is making a coefficient not be equal to a point for R. So this is sum of W square into C. Sum of? Into some lambda or C penalty constant. Yeah, yeah, times something. So the weight decay should also depend on the value of C. If it is very less, like if C is... Sorry, do you mean this? A? So if A is 0.1, then the weights might not go towards zero. Then we might not need weight decay. Well, whatever this value, I mean, if the value of this is zero, then there is no recordization. But if this value is higher than zero, then there is some penalty. And presumably we've set it to non-zero because we're overfitting, so we want some penalty. And so if there is some penalty, then my assertion is that we should penalize things that are different to our prior, not that we should penalize things that are different to zero. And our prior is that things should be around about equal to R. Okay, let's move on. Thanks for the great questions. I want to talk about embedding. I said pretend it's linear. And indeed we can pretend it's linear. Let me show you how much we can pretend it's linear, as in nn.linear, create a linear layer. Here is our data matrix. Here are our coefficients, if we're doing the R version, here are our coefficients R. So if we were to put those into a column vector, like so, right, then we could do a matrix multiply of that by that. And so we're going to end up with, so here's our matrix, here's our vector. So we're going to end up with one times one plus one times one, one times one, one times three, right, zero times one, zero times point three, right, and then the next one, zero times one, one times one, so forth. Okay, so that matrix multiply, you know, of this independent variable matrix by this coefficient matrix is going to give us an answer. So that is just a matrix multiply. So the question is like, okay, why didn't Jeremy write nn.linear? Why did Jeremy write nn.embedding? And the reason is because if you recall, we don't actually store it like this because this is actually of width 800,000 and of height 25,000, right. So rather than storing it like this, we actually store it as zero, one, two, three, right, one, two, three, four, zero, one, two, five, one, two, four, five, okay. That's actually how we store it, right, is this bag of words contains which word indexes. Does that make sense? Okay, so that's like, this is like a sparse way of storing it, right, is just list out the indexes in each sentence. So given that, I want to now do that matrix multiply that I just showed you to create that same outcome, right, but I want to do it from this representation. So if you think about it, all this is actually doing is it saying a one-hot, you know, this is basically one-hot encoded, right, it's kind of like a dummy matrix version. Does it have the word this, does it have the word movie, does it have the word is, and so forth. So if we talk the simple version of like does it have the word this, 1,000. And we multiplied that by that, that, then that's just going to return the first item, that makes sense? So in general, a one-hot encoded vector times a matrix is identical to looking up that matrix to find the nth row in it. So this is identical to saying find the zero, first, second and fifth coefficients. So they're the same, they're exactly the same thing. And like it doesn't, like in this case I only have one coefficient per feature, but actually the way I did this was to have one coefficient per feature for each class. So in this case it's both positive and negative, so I actually had kind of like an r positive and an r negative, so r negative would be just the opposite, right, equals that, divided by that. Now in the binary case, obviously it's redundant to have both, but what if it was like, what's the author of this text, is it Jeremy or Savannah or Terrence, right? And now we've got three categories, we want three values of r, right? So the nice thing is that in this sparse version, you know, you can just look up, you know, the zeroth and the first and the second and the fifth, right? And again it's identical, mathematically identical to multiplying by a one-hot encoded matrix. But when you have sparse inputs, it's obviously much, much more efficient. So this computational trick, which is mathematically identical to, not conceptually analogous to, mathematically identical to, multiplying by a one-hot encoded matrix, is called an embedding, right? So I'm sure you've all heard, or most of you've probably heard about, embeddings, like word embeddings, word to veck or glove or whatever, and people love to make them sound like they're some amazing, new, complex neural net thing, right? They're not. Embedding means make a multiplication by a one-hot encoded matrix faster by replacing it with a simple array lookup, okay? So that's why I said you can think of this as if it said self.w equals nn.linear and f plus 1 by 1, right? Because it actually does the same thing, right? It actually is a matrix with those dimensions. This actually is a matrix with those dimensions, right? It's a linear layer. But it's expecting that the input we're going to give it is not actually a one-hot encoded matrix, but it's actually a list of integers, right? The indexes for each word or for each item. So you can see that the forward function in FastAI automatically gets, for this learner, the feature indexes, right? So they come from the sparse matrix automatically. NumPy makes it very easy to just grab those indexes, okay? So with another word there, we've got here, we've got a list of each word index of the 800,000 that are in this document. And so then this here says lookup each of those in our embedding matrix, which has got 800,000 rows and return each thing that you find, okay? So mathematically identical to multiplying by the one-hot encoded matrix. Does that make sense? So that's all an embedding is. And so what that means is we can now handle building any kind of model like a, you know, whatever kind of neural network where we have potentially very high cardinality categorical variables as our inputs. We can then just turn them into a numeric code between zero and the number of levels. And then we can learn a, you know, a linear layer from that as if we had one-hot encoded it without ever actually constructing the one-hot encoded version and without ever actually doing that matrix multiply, okay? Instead, we will just store the index version and simply do the array lookup, okay? And so the gradients that are flowing back, you know, basically in the one-hot encoded version, everything that was a zero has no gradient. So the gradients flowing back is best go to update the particular row of the embedding matrix that we used, okay? And so that's fundamentally important for NLP, just like here. Like, you know, I wanted to create a PyTorch model that would implement this ridiculously simple little equation, right? And to do it without this trick would have meant I was beating in a 25,000 by at here to the 800,000 element array, which would have been kind of crazy, right? And so this trick allowed me to write, you know, you know, I just replaced the word linear with embedding, replaced the thing that feeds the one-hot encodings in with something that just feeds the indexes in, and that was it, then it kept working. And so this now trains, you know, in about a minute per epoch. Okay, so what we can now do is we can now take this idea and apply it not just to language, but to anything, right? For example, predicting the sales of items at a grocery. Yes, where's the pre-pass set? Just a quick question. So we are not actually looking up anything, right? We are just seeing that now that array with the indices, that is the representation. So the represent, so we are doing a lookup, right? The representation that's being stored for the bag of words is now not 111001, but 0125, right? And so then we actually have to do our matrix product, right? But rather than doing the matrix product, we look up the zero thing and the first thing and the second thing and the fifth thing. So that means we are still retaining the one-hot encoded matrix? No. We didn't. There's no one-hot encoded matrix used here. Here's the one-hot encoded matrix, which is not currently highlighted. We've currently highlighted the list of indexes and the list of coefficients from the weight matrix. Does that make sense? Yes. Okay. So what we're going to do now is we're going to go a step further and saying, like, let's not use a linear model at all. Let's use a multi-layer neural network, right? And let's have the input to that potentially be, include some categorical variables, right? And those categorical variables we will just have as numeric indexes. And so the first layer for those won't be a normal linear layer. There will be an embedding layer, which we know behaves exactly like a linear layer mathematically. And so then our hope will be that we can now use this to create a neural network for any kind of data, right? And so there was a competition on Kaggle a few years ago called Rossmann, which is a German grocery chain, where they asked to predict the sales of items in their stores, right? And that included a mixture of categorical and continuous variables. And in this paper by Gore and Birken, they described their third place winning entry, which was much simpler than the first place winning entry, but nearly as good, but much, much simpler because they took advantage of this idea of what they call entity embeddings. In the paper, they thought, I think, that they had invented this. Actually, it had been written before earlier by Yoshio Bengio and his co-authors in another Kaggle competition, which was predicting taxi destinations. Although I will say I feel like Gore went a lot further in describing how this can be used in many other ways. And so we'll talk about that as well. So this one is actually in the deep learning one repo, deal one, lesson three. Because we talk about some of the deep learning specific aspects in the deep learning course. Where else in this course we're going to be talking mainly about the feature engineering, and we're also going to be talking about this embedding idea. So let's start with the data. So the data was store number one on the 31st of July 2015 was open. They had a promotion going on. It was a school holiday. It was not a state holiday, and they sold 5,263 items. So that's the key data they provided. And so the goal is obviously to predict sales in a test set that has the same information without sales. They also tell you that for each store, it's of some particular type. It sells some particular assortment of goods. Its nearest competitor is some distance away. The competitor opened in September 2008, and there's some more information about promos. I don't know the details of what that means. Like in many Kaggle competitions, they let you download external datasets if you wish, as long as you share them with other competitors. They also told you what state each store is in. So people downloaded a list of the names of the different states of Germany. They downloaded a file for each state in Germany for each week. Some kind of Google Trend data. I don't know what specific Google Trend they got, but there was that. For each date, they downloaded a whole bunch of temperature information. That's it. And then here's the test set. One interesting insight here is that there was probably a mistake in some ways for Rossman to design this competition as being one where you could use external data. Because in reality, you don't actually get to find out next week's weather or next week's Google Trends. But when you're competing in Kaggle, you don't care about that. You just want to win, so you use whatever you can get. Let's talk first of all about data cleaning. There wasn't really much feature engineering done in this third place winning entry, particularly by Kaggle standards where normally every last thing counts. This is a great example of how far you can get with a neural net. It certainly reminds me of the claims prediction competition we talked about yesterday where the winner did no feature engineering and entirely relied on deep learning. The laughter in the room, I guess, is from people who did a little bit more than no feature engineering in that competition. I should mention, by the way, I find that bit where you work hard at a competition and then it closes and you didn't win and the winner comes out and says, this is how I won. That's the bit where you learn the most. Sometimes that's happened to me and it's been like, oh, I thought of that. I thought I tried that and then I go back and I realize I had a bug there. I didn't test properly and I learned like, oh, okay, I really need to learn to test this thing in this different way. Sometimes it's like, oh, I thought of that, but I assumed it wouldn't work. I've really got to remember to check everything before I make any assumptions. Sometimes it's just like, oh, I did not think of that technique. Wow, now I know it's better than everything I just tried because otherwise somebody says like, hey, here's a really good technique. You're like, okay, great, but when you spent months trying to do something and somebody else did it better by using that technique, that's pretty convincing. And so it's kind of hard like I'm standing up in front of you saying, here's a bunch of techniques that I've used and I've won some Kaggle competitions and I've got some state-of-the-art results, but that's kind of second-hand information by the time it hits you, right? So it's really great to try things out. And also like it's been kind of nice to see, particularly I've noticed in the deep learning course quite a few of my students have, you know, I've said like, this technique works really well and they've tried it and they've got into the top 10 of a Kaggle competition the next day and they're like, okay, that counts as working really well. So yeah, Kaggle competitions are helpful for lots and lots of reasons, but one of the best ways is what happens after it finishes. And so definitely like for the ones that are now finishing up, make sure you watch the forums, see what people are sharing in terms of their solutions. And if you want to learn more about them, feel free to ask the winners like, hey, could you tell me more about this so that people are normally pretty good about explaining. And then ideally try and replicate it yourself, right? And that can turn into a great blog post, you know, or a great kernel is to be able to say, okay, such and such said that they use this technique, here's a really short explanation of what that technique is, and here's a little bit of code showing how it's implemented and, you know, here's the result showing you you can get the same result. That can be a really interesting write up as well. Okay, so, you know, it's always nice to kind of have your data reflect, like, I don't know, be as kind of easy to understand as possible. So in this case, the data that came from Kaggle used various, you know, integers for the holidays. We can just use a Boolean of like, was it a holiday or not. So like, just clean that up. We've got quite a few different tables. We need to join them all together. Right. I have a standard way of joining things together with pandas. I just use the pandas merge function. And specifically, I always do a left joint. So who wants to tell me what a left join is? Since it's there, why don't you go ahead. So you retain all the rows in the left table and your take. You have a key column. You match that with the key column in the right side table and you just merge the rows that are also present in the right side table. No, that's a great explanation. Good job. I don't have much to add to that. The key reason that I always do a left join is that after I do the join, I always then check if there were things in the right hand side that are now null, right? Because if so, it means that I missed some things. I haven't shown it here, but I also check that the number of rows hasn't varied before and after. If it has, that means that the right hand side table wasn't unique. So even when I'm sure something's true, I always also assume that I've screwed it up. So I always check. So I can go ahead and merge the state names into the weather. I can also, if you look at the Google Trends table, it's got this week range, which I need to turn into a date in order to join it, right? And so the nice thing about doing this in pandas is that pandas gives us access to all of Python, right? And so for example, inside the series object is a .str attribute that gives you access to all the string processing functions. Not just like cat gives you access to the categorical functions. DT gives you access to the datetime functions. So I can now split everything in that column. And it's really important to try and use these pandas functions because they're going to be vectorized, accelerated through often through 7D, at least through C code. So that runs nice and quickly, right? And then, as per usual, let's add date metadata to our dates. In the end, we're basically denormalizing all these tables. So we're going to put them all into one table. So in the Google Trend table, there were mainly trends by state, but there was also trends for the whole of Germany. So we put the whole of Germany ones into a separate data frame. So that we can join that. So we're going to have Google Trend for this date and Google Trend for the whole of Germany. And so now we can go ahead and start joining, both for the training set and for the test set, and then both check that we don't have zeros. My merge function, I set the suffix. If there are two columns that are the same, I set the suffix on the left to be nothing at all, so it doesn't screw around with the name and the right-hand side to be underscore y. And in this case, I didn't want any of the duplicate ones, so I just went through and deleted them. And then, in a moment, we're going to try to create a competition. The main competitor for this store has been open since some date. And so you can just use pandas to date time, passing in the year, the month, and the day. And so that's going to give us an error unless they all have years and months, so we're going to fill in the missing ones with the 1900 and 01. And then what we really want to know is how long has this store been open for at the time of this particular record? So we can just do a date subtract. Now, if you think about it, sometimes the competition, you know, opened later than this particular row, so sometimes it's going to be negative, and it doesn't probably make sense to have negatives, meaning like it's going to open in X days' time. Now, having said that, I would never put in something like this without, first of all, running a model with it in and without it in, right? Because like our assumptions about the data very often turn out not to be true. Now, in this case, I didn't invent any of these pre-processing steps. I wrote all the code, but it's all based on the third place winner's GitHub repo, right? So knowing what it takes to get third place in the Kaggle competition, I'm pretty sure they would have checked every one of these pre-processing steps and made sure it actually improved their validation set score. Okay. So what we're going to be doing is creating a neural network where some of the inputs to it are continuous and some of them are categorical. And so what that means in the neural net that we have, we're basically going to have this kind of initial weight matrix, right? And we're going to have this input feature vector, right? And so some of the inputs are just going to be plain continuous numbers, like what's the maximum temperature here or what's the number of kilometers to the nearest store. And some of them are going to be one hot encoded effectively, right? But we're not actually going to store it as one hot encoded. We're actually going to store it as the index, right? And so the neural net model is going to need to know which of these columns should you basically create an embedding for, which ones should you treat, you know, as if they were kind of one hot encoded and which ones should you just feed directly into the linear layer, right? And so we're going to tell the model when we get there, which is which. But we actually need to think ahead of time about, like, which ones do we want to treat as categorical and which ones are continuous. In particular, things that we're going to treat it as categorical, we don't want to create more categories than we need, right? And so let me show you what I mean. The third place-getters in this competition decided that the number of months that the competition was open was something that they were going to use as a categorical variable, right? And so in order to avoid having more categories than they needed, they truncated it at 24 months. They said, anything more than 24 months, I'll truncate to 24. So here are the unique values of competition months open and it's all the numbers from 0 to 24, right? So what that means is that there's going to be, you know, an embedding matrix that's going to have basically an embedding vector for things that aren't open yet, for things that are open in a month, for things that are open two months and so forth. Now, they absolutely could have done that as a continuous variable, right? They could have just had a number here which is just a single number of how many months has it been open and they could have treated it as continuous and fed it straight into the initial weight matrix. What I found, though, and obviously what these competitors found is where possible, it's best to treat things as categorical variables, right? And the reason for that is that, like, when you feed something through an embedding matrix, you basically mean, it means every level can be treated, like, totally differently, right? And so for example, in this case, whether something's been open for zero months or one month is, like, really different, right? And so if you fed that in as a continuous variable, it would be kind of difficult for the neural net to try and find a functional form that kind of has that big difference. It's possible because neural nets can do anything, right? But you're not making it easy for it. Whereas if you used an embedding, treated it as categorical, then it'll have a totally different vector for zero versus one, right? So it seems like, particularly as long as you've got enough data, that treating columns as categorical variables where possible is a better idea. And so I say, when I say where possible, that kind of basically means, like, where the cardinality's not too high. You know, so if this was like, you know, the sales ID number that was, like, uniquely different on every row, you can't treat that as a categorical variable, right? Because it'll be a huge embedding matrix and everything only appears once. Or data for, like, kilometers away from the nearest store to two decimal places, you wouldn't make a categorical variable, right? So that's kind of the rule of thumb that they both used in this competition. In fact, if we scroll down to their choices, here is how they did it, right? Their continuous variables were things that were genuinely continuous, like number of kilometers away to the competitor, the temperature stuff, right? The number, you know, the specific number in the Google trend, right? Where else, everything else, basically, they treated it as categorical. Okay. Okay, so that's it for today. So, yeah, next time we'll finish this off. We'll see how to turn this into a neural network. And, yeah, kind of wrap things up. So, see you then.