 Okay, hey everybody and welcome to practical deep learning for coders lesson five We're at a stage now where we're going to be getting deeper and deeper into the details of how these Networks actually work last week we saw how to use a slightly lower level library than fast AI being hugging face transformers to train a pretty nice NLP model and Today we're going to be Going back to tabular data and we're going to be trying to build a tabular model actually From scratch we're going to be able to a couple of different types of tabular model from scratch so the problem that I'm going to be working through is the Titanic problem, which if you remember back a couple of weeks is the data set that we looked at on Microsoft Excel and It has each row is one passenger on the Titanic. So this is a real-world data set Historic data set tells you both of that passenger survived What class they were on in the ship their sex age how many siblings how many other family members how much they spent in the fair and Whereabouts they embarked one of three different cities And you might remember that we built a linear model We then did the same thing using Matrix multiplication, and we also created a very very simple neural network You know Excel can do Nearly everything we need as you saw to build a neural network, but it starts to get Unreality and so that's why people don't use Excel for neural networks in practice instead. We use the programming language like Python So what we're going to do today is we're going to do the same thing With Python So we're going to start working through the linear model and neural net from scratch notebook Which you can find on Kaggle Or on the course repository and Today what we're going to do is we're going to work through the one in the clean folder. So both for fast book the book And course 22 These lessons the clean folder Contains all of our notebooks, but without Any pros or any outputs, so here's what it looks like when I open up the linear model and neural net from scratch and Jupiter What I'm using here is Paper space gradient Just I mentioned a couple of weeks ago is what I'm going to be doing most things in That looks a little bit different to the normal paper space gradient Because the the default view for paper space gradient At least as I do this course is there rather awkward notebook editor Which at first glance has The same features as the the real Jupiter notebooks and Jupiter lab environments But in practice are actually missing lots of things. So this is the the normal paper space, so remember you have to click this button right and The only reason you might keep this window running is then you might go over here to the machine to remind yourself when you close the other tab to click stop machine If you're using the free one, it doesn't matter too much And also when I started I make sure I've got something it to set to shut down automatically if case I forget So other than that we're gonna we can stay in this tab and because Jupiter's this is Jupiter lab that that that runs and you can always switch over to classic Jupiter notebook if you want to So Given that they kind of got tabs inside tabs, I normally maximize it at this point And it's really good. It really helpful to know the keyboard shortcuts. So control shift Square bracket right and left switch between tabs. That's one of the key things to know about. Okay, so I've opened up the clean version of the linear model and neural net from scratch notebook and so remember when you Go back through the video kind of the second time Or through the notebook a second time This is generally what you want to be doing is going through the clean notebook and before you run each cell We'll try to think about like, oh, what did Jeremy say? What why why are we doing this? What output what I expect make sure you get the output you'd expect and if you're not sure why something is the way it is Try changing it and see what happens and then if you're still not sure Well, why did that thing not work the way I expect? You know search the forum see if anybody's asked that question before and you can ask the question on the forum yourself if you're still not sure So as I think we've mentioned briefly before I find it really nice to be able to use the same notebook both on Kaggle and off Kaggle So most of my notebooks start with basically the same cell Which is something that just checks whether we're on Kaggle. So Kaggle sets an environment variable So we can just check for it that way we know if we're on Kaggle And so then if we are on Kaggle, you know a notebook that's part of a competition will already have the data downloaded and unzipped for you Otherwise If I haven't downloaded the data before then I need to download it and unzip it. Okay, so Kaggle is a pit installable module. So you type pit install Kaggle if You're not sure how to do that You should check our deep dive lessons to see exactly the steps, but roughly speaking You can use your console pit install and whatever you want to install Or as we've seen before you can do it directly in a notebook by putting an explanation mark at the start So that's going to run not Python, but a shell command Okay, so that's enough to ensure that we have the data downloaded and a variable called path that's pointing at it Most of the time we're going to be using at least PyTorch and NumPy So we import those so that they're available to Python and when we're working with tabular data as we talked about before We're generally also going to want to use pandas and it's really important that you're somewhat familiar with the kind of basic API of these Three libraries and I've recommended Wes McKinney's book before particularly for these ones One thing just by the way is that these things tend to assume you've got a very narrow screen Which is really annoying because it always wraps things So if you want to put these three lines as well, then it just makes sure that everything is going to use up the screen probably Okay, so as we've seen before you can read a common separated values file with pandas And you can take a look at the first few lines and the last few lines and how big it is And so here's the same thing as our spreadsheet Okay, so there's our data from the spreadsheet and here it is as a data frame So if we go Data frame dot is And a That returns a new data frame in which every column it tells us whether or not that particular value is Nan so nan is not a number and most written norm the most common reason you get that is because it was missing Okay, so I'm missing value is obviously not a number So we In the Excel version we did something you should never usually do we deleted all the rows with missing data Just because in Excel it's a little bit harder to work with in pandas. It's very easy to work with first of all We can just sum up what I just showed you now if you call some on a data frame It sums up each column. All right, so you can see that there's kind of some Small foundational concepts in pandas Which when you put them together take you a long way So one idea is this idea that you can put you can call a method on a data frame and it calls it on every row and Then you can call a reduction on that and it reduces each column and so now we've got the total and In Python and pandas and numpy and pytorch You can treat a boolean as a number and true will be one false will be zero So this is the number of missing values in each column So we can see that cabin out of 891 rows. It's nearly always empty Age is empty a bit of the time and but it's almost never empty so If you remember from Excel We need to multiply a coefficient By each column. That's how we create a linear model. So how would you multiply a coefficient by a missing value? You can't There's lots of ways of it's called imputing missing values. So replacing missing value with a number The easiest which always works is to replace missing values with the mode of a column The mode is the most common value that works both the categorical variables It's the most common category and continuous variables. That's the most common Number so you can get the mode By calling DF mode One thing that's a bit awkward is that if there's a tie For the mode so there's more than one thing that's that's the most common It's going to return multiple rows. So I need to return the zeroth row So here is the mode of every column So we can replace the missing values for age with 24 and the missing values for cabin with B96 B98 and in but with s Our suspension in passing I Am not going to describe Every single method we call and every single function we use and that is not because You're an idiot if you don't already know them Nobody knows them all right, but I don't know which particular subset of them You don't know right. So let's assume just to pick a number at random that the average fast AI student knows 80% of the functions we call Then I Could tell you What every function is in which case 80% of the time I'm wasting your time because I already know Or I could pick 20% of them at random in which case I'm still not helping because most of the time It's not the ones you don't know My approach is that for the ones that are pretty common I'm just not going to mention it at all because I'm assuming that you'll Google it right so it's really important to know So for example, if you don't know what I lock is That's not a problem. It doesn't mean you're stupid right? It just means you haven't used it yet and you should Google it right So I mentioned in this particular case, you know, this is one of the most important Pandas methods because it gives you the Row located at this index I for index and lock for location. So this is a zero through But yeah, I do kind of go through things a little bit quickly on the assumption that students fast AI students are You know proactive curious people And if you're not a proactive curious person then you could either decide to become one for the purpose of this course Or maybe this course isn't for you All right, so a data frame Has a very convenient method called Phil and a and that's going to replace the not a numbers with whatever I put here and The nice thing about pandas is it kind of has this understanding that columns match the columns So it's going to take the the mode from each column and Match it to the same column in the data frame and fill in those missing values Normally that would return a new data frame Many things including this one in pandas have an in-place argument that says actually modify the original one and So if I run that now if I call dot is an a dot some They're all zero So that's like the world's simplest way to get rid of missing values. Okay, so Why did we do it the world's simplest way? because honestly This doesn't make much difference most of the time and So I'm not going to spend time the first time I go through and build a baseline model Doing complicated things when I don't necessarily know that I Need complicated things and so imputing missing values as an example of something that most of the time This dumb way which always works without even thinking about it will be quite good enough You know for nearly all the time. So we keep things simple where we can John question Jeremy we've got a question on this topic Javier is Sort of commenting on the assumption involved in substituting with the mode And he's asking in your experience. What are the pros and cons of doing this? Versus for example discarding cabin or age as fields that we even train the model. Yeah, so I would certainly never throw them out, right? There's just no reason to throw away data and there's lots of reasons to not throw away data. So for example When we use the fast AI library, which we'll use later One of the things it does which is actually a really good idea is it creates a new column For everything that's got missing values, which is bullion which is did that column have a missing value for this row And so maybe it turns out that Cabin being empty is a great predictor So yeah, I don't throw out rows and I don't throw out columns Okay, so It's helpful to understand a bit more about our data set and a really helpful I've already imported this Really helpful, you know quick method and again, it's kind of nice to know like a few Quick things you can do to get a picture of what's happening in your data is describe and So describe you can say okay describe all the numeric variables And that gives me a quick sense of What's going on here? So we can see survived clearly is just zeros and ones Because all of the quartiles are zeros and ones looks like p-class is one two three Um What else do we see fairs an interesting one, right? Lots of smallish numbers and one really big numbers are probably long-tailed So it's yeah, good to have a look at this to see what's what's going on for your numeric variables So as I said fair looks kind of interesting To find out what's going on there. I would generally go with a histogram So if you can't quite remember what a histogram is again Google it But in in short it shows you for each amount of fair. How often does that fair appear? And it shows me here that the vast majority of fairs are less than $50 But there's a few right up here to 500. So this is what we'd call a long-tailed distribution small number of really big Values and lots of small ones There are some types of model which do not like long-tailed distributions Lenya models are certainly one of them And neural nets are generally better behaved without them as well Luckily, there's a almost sure-fire way to turn a long-tailed distribution into a more reasonably centered distribution And that is to take the log We use logs a lot in machine learning for those of you that haven't touched them since year 10 math It would be a very good time to like go to Khan Academy or something and remind yourself about What logs are and what they look like because they're actually really really important But the basic shape of the log curve causes it to make you know really big numbers less really big and doesn't change really small numbers very much at all So if we take the log now log of zero is Nan So a useful trick is to just do log plus one And in fact there is a log P1 if you want to do that does the same thing So if we look at the histogram of that you can see it's much more you know Sensible now it's kind of centered and it doesn't have this big long tail So that's pretty good. So We'll be using that column in the future as a rule of thumb Stuff like money or population things that kind of can grow exponentially You very often want to take the log of so if you have a column with a dollar sign on it That's a good sign. It might be something to take the log of So there was another one here, which is we had a numeric Which actually doesn't look numeric at all. It looks like it's actually categories So pandas gives us a dot unique And so we can see yep, they're just one two and three are all the levels of P class. That's their first class second class or third class We can also describe all the non-numeric variables and So we can see here that not surprisingly names are unique because the count of names is the same as count unique is two sexes 681 different tickets 147 different cabins and three levels of embarked so We cannot multiply the letter s by a coefficient Or the word male By a coefficient, so what do we do? What we do is we create something called dummy variables dummy variables are We can just go get dummies a Column that says for example is sex female is sex male is p-class one is p-class two is p-class three So for every possible level of every possible categorical variable It's a Boolean column that did that row have that value of that column So I think we've briefly talked about this before that there's a couple of different ways We can do this one is that for n an n level categorical variable. We could use n minus one levels In which case we also need a constant term in our model pandas by default Shows all n levels although you can pass an argument to change that if you want Yeah, drop first I Kind of like having all of them sometimes because then you don't have to put in a constant term and it's a bit less annoying And it can be a bit easier to Interpret but I don't feel strongly about it either way Okay, so here's a list of all of the columns That pandas added I guess strictly speaking I probably should have automated that but never mind I just copied and pasted them and so here are a Few examples of the added columns in Unix pandas lots of things like that head means the first few rows or the first few Lones So five by default and pandas so here you can see They're never both male and female. They're never neither. They're always one or the other All right, so With that now we've got numbers which we can multiply by coefficients It's not going to work for Name Obviously because we would have 891 columns and all of them would be unique So we'll ignore that for now. That doesn't mean it's have to always ignore it And in fact something I did do Something I did do on the forum topic because I made a list of some nice Titanic notebooks that I found and quite a few of them Really go hard on this name column and in fact one of them Yeah, this one in what I believe is Yes Christiots first ever Kaggle notebook. He's now the number one ranked Kaggle notebook person in the world So this is a very good start He got a much better score than any model that we're going to create in this course using only that column name and Basically, yeah, he came up with this simple little decision tree By recognizing, you know all of the information that's in a name column So, yeah, we don't have to Treat and you know a big string of letters like this is a random big string of letters We can use our domain expertise to recognize that things like mister have meaning and that people with the same surname might be in the same family and Actually figure out quite a lot from that But that's not something I'm going to do I'll let you look at those notebooks if you're interested in the feature engineering and I do I do think that They're very interesting. So do you check them out? our focus today is on Building a linear model on a neural net from scratch not on tabular feature engineering even though that's also a very important subject Okay, so We talked about how matrix multiplication makes linear models much easier And the other thing we did next cell was element wise multiplication Both of those things are much easier If we use pie torch instead of plain python or we could use numpy But I tend to just stick with pie torch when I can because it's easier to learn one library than two So I just do everything in pie torch. I Almost never touch numpy nowadays. They're both great, but They do everything each other does except pie torch also does differentiation and GPUs. So why not just learn pie torch? so To turn a column into something that I can do Pie torch calculations on I have to turn it into a tensor So a tensor is just what numpy call numpy calls an array It's what mathematicians will call either a vector or a matrix or once we go to higher ranks Mathematicians and physicists just call them tensors In fact this idea originally in computer science came from a Notation developed in the 50s called APL which has turned into a programming language in the 60s by a guy called Ken Iverson And Ken Iverson actually came up with this idea From he said his time doing tensor analysis in physics. So the these Areas are very related so we can turn the survived column into a tensor And we'll call that tensor our dependent variable. That's the thing we're trying to predict Okay, so now we need some independent variables. So our independent variables are Age siblings That one is oh Yeah, a number of other family members The log of fair that we just created plus all of those dummy columns we added and So we can now Grab those values and turn them into a tensor and we have to make sure they're floats We want them all to be the same data type and pytorch wants things to be floats if you're going to model play things together So there we are and so one of the most important Attributes of a tensor probably the most important attribute is its shape which is how many rows does it have and How many columns does it have? The length of the shape is Called its rank. That's the rank of the tensor. It's a number of dimensions or axis that it has So a vector is length is rank one a matrix is rank two a Scalar is rank zero And so forth I Try not to use too much jargon But there's some pieces of jargon that are really important because you like otherwise you're going to have to say The length of the shape again and again. It's much easier to say rank. So we'll say we'll use that word a lot So a table is a rank two tensor Okay, so we've now got the data In good shape here's our independent variables and we've got our dependent variable So we can now go ahead and do exactly what we did in excel Which is to multiply our rows of data By using coefficients and remember to start with we create random coefficients So we're going to need one coefficient For each column now in excel. We also had a constant But in our case now we've we've got every column every level in our dummy variables So we don't need a constant. So the number of coefficients. We need Is equal to the shape of the independent variables and it's the Index one element. That's the number of columns. That's how many coefficients we want So we can now crew Ask torch pay torch to give us some random numbers and cof of them They're between zero and one. So if we subtract a half then they'll be centered And there we go Before I do that I Set the seed what that means is um in computers Computers in general cannot create truly random numbers Instead they can calculate a sequence of numbers that behave in a random like way That's actually good for us because often in my teaching I like to be able to say, you know in the pros Oh, look that was two now it's three or whatever and if I was using really random numbers, then I couldn't do that because it'd be different each time. So this is Makes my results reproducible. That means if you run it, you'll get the same random numbers as I do by saying start the pseudo random sequence with this number I'll mention in passing a lot of people are very very into Reproducible results. They think it's really important to always do this um I strongly disagree with that In my opinion, uh an important part of understanding your data is understanding How much it varies from run to run? So if I'm not teaching And wanting to be able to write things about these pseudo random numbers. I almost never Use a manual seed Instead I like to run things a few times and get an intuitive sense of like, oh, this is like very very stable Or oh, this is all over the place. I'm getting an intuitive understanding of how your data behaves and your model behaves is really important Now here's one of the coolest lines of code You'll ever see I know it doesn't look like much But think about what it's doing Yeah, I don't know Okay, so we've multiplied a matrix By a vector Now that's pretty interesting. Now mathematicians amongst you will know that you can certainly do a Matrix vector product, but that's not what we've done here at all We've used element wise multiplication. So normally if we did the element wise multiplication of two vectors, it would multiply You know element one with element one element two with element two and so forth and create a vector of the same size output But here We've done a matrix times a vector. How does that work? This is using the incredibly powerful technique of broadcasting and broadcasting again comes from APL a notation invented in the 50s in a programming language developed in the 60s um And it's got a number of benefits. Basically what it's going to do is it's going to take each coefficient and model play them in turn by Every row in our matrix. So if you look at the shape Of our independent variable and the shape Of our coefficients, you can see that each one of these coefficients Can be multiplied by each of these 891 Values in turn And so the reason we call it broadcasting is it's as if this is 891 columns by 12 rows by 12 columns It's as if this was broadcast 891 times It's as if we had a loop looping 891 times and doing coefficients times row zero coefficients times row one Coefficients times for zero two and so forth, which is exactly what we want now Reasons to use broadcasting Obviously the code is much more concise It looks more like math rather than Plunky programming with lots of boilerplate. So that's good also That broadcasting all happened in optimized c code And if in fact if it's being done on a gpu it's being done in optimized gpu assembler Coder code It's going to run very very fast indeed And this is a trick of why we can use a so-called slow language like python to do Very fast big models is because a single line of code like this can run very quickly On optimized hardware on lots and lots of data the rules of broadcasting are A little bit subtle and important to know and so I would strongly encourage you to google numpy broadcasting rules And see exactly how they work, but you know the kind of intuitive understanding of them Hopefully you'll get pretty quickly which is generally speaking You can kind of as long as the last axes match It'll broadcast over those axes you can Broadcast a rank three thing with a rank one thing or um, you know most simple version would be tensor one two three Times two so broadcast a scalar Over a vector that's exactly what you would expect so it's copying effectively that two into each of these spots Model playing them together, but it doesn't it doesn't use it up any memory to do that. It's kind of a virtual copying if you like So this line of code independence by coefficients is very very important And it's the key step that we wanted to take which is now we know exactly how What happens when we multiply the coefficients in? And if you remember back to excel um We did that product And then in excel there's a sum product we then added a roll together because that's what a linear model is it's the Coefficients times the values added together, so we're now going to need to add it add those together But before we do that If we did add up this row You can see that the very first value Has a very large magnitude and all the other ones are small same with row two Same with row three same as row four. What's going on here? Well, what's going on is that the very first column was age And age is much bigger than any of the other columns That's not the end of the world, but it's not ideal right because It means that a coefficient of say 0.5 times age means something very different to a coefficient of say 0.5 times log fair Right and that means that that random Coefficient we start with it's going to mean very different things for different columns and that's going to make it really hard to optimize So we would like all the columns to have about the same range So um what we could do As we did in excel is to divide them By the maximum so the maximum so we did it for age and we also did it for fair in this case. I didn't use log um So we can get the max Of each row by calling dot max And you can pass in a dimension. Do you want the maximum of the rows or the maximum of the columns? We want the maximum over the rows So we pass in dimension zero So those different parts of the shape Are called either axes or dimensions PyTorch calls them dimensions. So that's going to give us The maximum of each row and if you look at the docks for PyTorch's max function, it'll tell you it returns two things the actual value of each maximum And the index of where where which row it was Um, we want the values So now thanks to broadcasting we can just say take the independent variables And divide them by the vector of values again. We've got a matrix And a vector and so this is going to do an element wise division of each row of this Divided by this vector Again in a very optimized way so if we now look at our Normalized independent variables by the coefficients You can see they're all pretty similar values. So that's good There's lots of different ways of normalizing but the main ones you'll come across is either dividing by the maximum Or subtracting the mean and dividing by the standard deviation It normally doesn't matter too much Because i'm lazy i just pick the easier one and being lazy and picking the easier one is a very good plan in my opinion So now that we can see that multiplying them together is working pretty well We can now add them up and now we want to add up Over the columns And that would give us predictions now obviously just like in excel when we started out They're not useful predictions because they're random coefficients, but they are predictions nonetheless And here's the first 10 of them So then remember We want to Use gradient descent to try to make these better So to do gradient descent we need a loss right the loss is the measure of how good or bad are these coefficients um My favorite loss function As a kind of like Don't think about it. Just chuck something out there is the mean absolute value And here it is torch dot absolute value of The error The difference take the mean And often stuff like this you'll see people will use pre-written Mean absolute error functions, which is also fine But I quite like to write it out because I can see exactly what's going on No confusion No chance of misunderstanding So those are all the steps i'm going to need to create coefficients Run a linear model And get its loss So what I like to do in my notebooks like not just for teaching but all the time is to like do everything step by step manually And then just copy and paste the steps into a function So here's my calc preds function is exactly what I just did right Here's my calc loss function Exactly what I just did um, and that way You know, I a lot of people like go back and delete all their explorations Or they like do them in a different notebook or they're like working in an ide they'll go and do it in some You know Line-oriented rep or whatever, but if you you know think about the benefits of keeping it here When you come back to it in six months, you'll see exactly why you did what you did and how we got there Or if you're showing it to your boss or your colleague, you can see, you know, exactly what's happening What does each step look like? I think this is really very helpful indeed I know not many people code that way But I feel strongly that it's a huge productivity wind to individuals and teams So remember from our um gradient descent from scratch that The one bit we don't want to do from scratch is calculating derivatives because it's just menial and boring So to get pie torch to do it for us, you have to say well, what things do you want derivatives for? And of course we want it for the coefficients So then we have to say requires great And remember very important and pie torch if there's an underscore at the end That's an in-place operation. So this is actually going to change cofs It also returns them right, but it also changes them in place So now we've got exactly the same numbers as before but with requires grad turned on So now when we calculate our loss That doesn't do any other calculations, but what it does store Is a gradient function. It's the function that python has remembered that it would have to do to undo those steps to get back to the gradient and to say oh, please actually call that backward gradient function you call backward And at that point it sticks into a dot grad attribute The coefficient the coefficients gradients So this tells us That if we increased the age coefficient The loss would go down. So therefore we should do that right so Since negative means increasing this would decrease the loss That means we need to if you remember back to the gradient descent from scratch notebook. We need to subtract the coefficients times the learning rate So we haven't got any particular ideas yet of how to set the learning rate So for now I just pick a just try a few and still find out what works best in this case. I found point one worked pretty well So I now subtract so again, this is sub underscore so subtract in place from the coefficients their gradient times the learning rate And so the loss has gone down. That's great from point five four to point five two So there is one step So we've now got everything we need to train a linear model So let's do it now As we discussed last week To see whether your model is any good. It's important that you split your data into training and validation Um for the titanic data set, it's actually pretty much fine to use a random split Because back when we friend Margit and I actually created this competition for Kaggle many years ago That's basically what we did if I remember correctly So we can split them randomly Into a training set and a validation set So we're just going to use a fast ai for that There's you know, it's very easy to do it manually with numpy or py torch. You can use scikit learns chain test split I'm using fast ai's here partly because it's easy just to remember one way to do things and this works everywhere And partly because in the next notebook, we're going to be seeing how to do more stuff in fast ai So I want to make sure we have exactly the same split so those are A list of the indexes of the rows that will be for example in the validation set That's why I call it validation split. So to create the validation independent variables. You have to use those to index into The independent variables and ditto for the dependent variables And so now we've got our independent variable training set and our Validation set and we've also got the same for the dependent variables So like I said before I normally take stuff that I've already done in a notebook seems to be working and put them into functions So here's the step which actually updates coefficients. So let's chuck that into a function And then the steps that go cut last stop backward update coefficients and then print the loss or chuck that in one function So just copying and pasting stuff into cells here And then the bit on the very top of the previous section that got the random numbers minus 0.5 requires grad Chuck that in the function So here's got something that initializes coefficients something that does one epoch by updating coefficients So we can put that into a together into something that trains the model for any epochs with some learning rate By setting the manual seed initializing the coefficients Doing one epoch in a loop And then return the coefficients So let's go ahead and run that function So it's printing at the end of each one The loss and you can see the loss going down from 0.53 down down down down down To a bit under 0.3. So that's good. We have successfully built and trained a linear model On a real data set. I mean it's a Kaggle data set, but it's important to like Not underestimate how real Kaggle data sets are. They're real data And this one's a playground data set. So it's not like anybody actually cares about predicting who survived the titanic because we already know Um, but it has all the same features of you know, different data types and missing values and normalization and so forth So, you know, it's a good it's a good playground So it'd be nice to see what the coefficients are attached to each variable So if we just zip together the independent variables And the coefficients and we don't need to regret anymore And create a dict of that There we go. So Uh, it looks like older people had less chance of surviving That makes sense Males had less chance of surviving Also makes sense So it's good to kind of eyeball these And check that they seem reasonable Now the um metric For this Kaggle competition is not Um, main absolute error It's accuracy Now, of course, we can't use accuracy as a loss function because it doesn't have a sensible gradient really Um, but we should measure accuracy to see how we're doing because that's going to tell us How we're going against the thing that the Kaggle competition cares about So we can calculate our predictions And we'll just say okay, well any times the predictions over 0.5 Uh, we'll say that's predicting survival So that's our predictor of survival This is the actual in a validation set. So if they're the same Then we predicted it correctly So here's Are we right or wrong for the first 16 rows? We're right more often than not So if we take the mean of those remember true equals one Then that's our accuracy. So we're right about 79% of the time That's not bad. Okay, so we've successfully created something that's Actually predicting Who survived the Titanic that's cool from scratch So let's create a function for that an accuracy function that just does what I showed And there it is Now say another thing like You know my weird coding thing for me, you know weird doesn't not that common is I use Less comments than most people Because all of my code lives in notebooks and of course in the real version of this notebook is full of pros right, so When I've taken people through a whole journey About what I've built here and why I've built it and what intermediate results are and check them along the way The function itself then my you know for me doesn't need extensive comments, you know, I'd rather explain the Thinking of how I got there and show examples of how to use it and so forth Okay, now Here's the first few predictions we made And some of the time we're predicting negatives for survival and greater than one for survival Which doesn't really make much sense, right people either survived one or they didn't zero Um, it would be nice If we had a way to automatically squish everything between zero and one That's going to make it much easier to optimize Um, the the the optimizer doesn't have to try hard to hit exactly one or hit exactly zero But it can just like try to create a really big number to mean Survived or a really small number to mean perished Here's a great function Here's a function that as I increase Let's make it even bigger range As my numbers get beyond four or five It's asymptoting to one and on the negative side as they get beyond negative four or five. They asymptote to zero Or to zoom in a bit But then around about zero it's pretty much a lot a straight line This is actually perfect. This is exactly what we want So here is the equation one over one plus eight of the megat minus x and this is called the sigmoid function By the way, if you haven't checked out simpi before Definitely do so. This is the symbolic python package which can do It's kind of like Mathematica or Wolfram style symbolic calculations Including the ability to plot Symbolic expressions, which is pretty nice Um PyTorch already has a sigmoid function. I mean it just calculates this but it does it in a more optimized way So what if we replaced calc-preds? Remember before calc-preds was Just this What if we took that and then put it through a sigmoid? So calc-preds are now basically the bigger Oops The bigger this number is The closer it's going to get to one and the smaller it is the closer it's going to get to zero There should be a much easier thing to optimize and ensures that all of our Values are in a sensible range Now here's another cool thing about using Jupiter plus python Python is a dynamic language Even though I called calc-preds Train model calls one epoch Which calls calc loss Which calls calc-preds I can redefine calc-preds now And I don't have to do anything that's now Insert it into python symbol table, and that's the calc-preds that train model will eventually call So if I now call train model That's actually going to Call my new version of calc-preds. So that's a really neat way of doing exploratory Programming in python. I wouldn't You know release You know a library that redefines calc-preds multiple times, you know when i'm done I would just keep the final version of course, but it's a great way to try things as you'll see And so look what's happened. I found I was able to increase the learning rate from 0.1 to 2 Yeah, it was much easier to optimize as I guessed and the Loss has improved from 0.295 to 0.197 The accuracy has improved from 0.79 to 0.82 nearly 0.83 So As a rule This is something that we're pretty much always going to do when we have a binary dependent variable So dependent variable that's one or zero Is the very last step is chuck it through a sigmoid Generally speaking if you're wondering Why is my model with a binary dependent variable not training very well? This is the thing you want to check. Oh, are you chucking it through a sigmoid or is the thing you're calling chucking it through a sigmoid or not It can be surprisingly hard to find out if that's happening So for example with hugging face transformers. I actually found I had to look in their source code To find out and I discovered that something I was doing wasn't And didn't seem to be documented anywhere but it is important to To find these things out As we'll discuss in the next lesson We'll talk a lot about Neural net architecture details But the details we'll focus on what happens to the inputs at the very first stage and what happens to the outputs at the very last stage We'll talk a bit about what happens in the middle But a lot less and the reason why Is it's the things that you put into the inputs That's got a change for every single data set you do And what do you want to happen to the outputs? It's just going to happen for every different Target that you're trying to hit. So those are the things that you actually need to know about So for example, this thing of like or you need to know about the sigmoid function And you need to know that you need to use it fast AI Is very good at handling this for you. That's why we haven't had to talk about it much until now If you say oh, it's a it's a category block dependent variable You know, it's going to use the right kind of thing for you, but most things are not So convenient uh john sir question Yes, there is um, it's back In the sort of the feature engineering topic Um, but a couple of people have liked it. So I thought we'd put it out there So shivan Says one concern I have while using get dummies. All right, so it's in that get dummies phase Is what happens while using test data? I have a new category. Let's say male female and other And this will have an extra column missing from the training data How do you take care of that? That's a great question. Yeah, so Normally you've got to think about this pretty carefully and check pretty carefully unless you use fast ai so fast ai Always creates an extra category called other And at test time inference time if you have some level That didn't exist before we put it into the other category for you um, otherwise You basically have to do that yourself Or or at least check, you know Generally speaking it's pretty likely that otherwise your um extra Level will be silently ignored, you know, because it's going to be in the data set, but it's not going to Be matched to a column. So yeah, it's it's a good point and definitely worth checking For for categorical variables with lots of levels I actually normally like to put the less common ones into another Category and again, that's something that fast ai will do for you automatically But yeah, definitely something to keep an eye out for Good question Okay, so before we take our break, we'll just do one last thing Which is we will submit this to kaggle because I think it's quite cool that we have successfully built a model from scratch So kaggle provides us with a test dot csv, which is exactly the same structure as the training csv Except that it doesn't have a survive column Now interestingly when I tried to submit to kaggle, um, I got an error in my code Uh saying that oh, uh, one of my fares Is empty So that was interesting because the training set doesn't have any empty fares Um, so sometimes this will happen that the the training set in the test set have Different things to deal with so in this case, I just said oh, there's only one row. I don't care So I just replaced the empty one with a with a zero for fair So then I just copied and pasted the pre-processing to pre-processing steps from my Um training data frame and stuck them here for the test data frame and the normalization as well And so now I just call cut prets is it greater than 0.5 Turn it into a zero or one because that's what kaggle expects And put that into the survived column which previously remember didn't exist So then finally I created data frame with just the two columns id and survived Stick it in a csv file And then I can call the unix command head just to look at the first few rows And um, if you look at the kaggle competitions Data page you'll see this is what the submission file is expected to look like so that made me feel good So I went ahead and submitted it I didn't mention it. Okay. So anyway, I submitted it and I remember I got like I think I was basically a bright in the middle about 50 percent You know better than half the people who have entered the competition worse than half the people So, you know solid middle of the pack result for a linear model from scratch. I think it's a pretty good result So that's a great place to start. So let's take a 10 minute break. We'll come back at 7 17 and Continue on our journey. All right. Welcome back. Um You might remember from excel that after we did the some product version um We then Replaced it with a matrix model play Wait, not there must be here Here we are Where the matrix model play? So let's do that step now. So um matrix times vector dot sum over axis equals one Is the same thing as matrix model play? um, so here is The times dot sum version now We can't use this character for a matrix model play because it means element wise operation All of the times plus minus divide in pytorch numpy mean element wise so corresponding elements So in python instead we use this character Far as I know, it's pretty arbitrary. It's one of the ones that wasn't used um So that is an official python It's a bit unusual. It's an official python operator. It means matrix model play But python doesn't come with an implementation of it Um, so because we've imported because these are tensors and in pytorch, it'll use pytorches And as you can see they're exactly the same So we can now just simplify a little bit what we had before calc preds is now torch dot sigmoid of the matrix model play Now there is one thing I'd like to move towards now is that we're going to try to create a neural net in a moment And so that means rather than treat this as A matrix times a vector I want to treat this as a matrix times a matrix Because we're about to add some more Um, columns of coefficients so, um We're going to change in a cof so that rather than creating An n cof vector We're going to create an n cof by one matrix So in math, we would probably call that a column vector But I think that's a kind of a dumb name in some ways because it's it's a matrix, right? It's a rank two tensor um, so Um, like the the matrix model play will work fine either way But the key difference is that if we do it this way Then um, the result of the matrix multiplier will also be a matrix. It'll be again a n rows by one matrix That means when we compare it to the dependent variable, we need the dependent variable to be an n rows by one matrix as well So effectively we need to take the n rows long vector and turn it into an n rows by one matrix matrix Um, so there's some useful very useful and at first maybe a bit weird Notation in pi torch num pi for this which is if I take my training dependent variables vector I index into it And colon means every row Right, so in other words That just means the whole vector Right, it's the same basically as that And then I index into a second dimension Now this doesn't have a second dimension So there's a special thing you can do which is if you index into a second dimension with a special value none It creates that dimension so this has the effect of adding an extra trailing dimension to train dependence so it turns it from a vector to a matrix with one column So if we look at the shape after that as you see it's now got We call this a unit axis. It's got a trailing unit axis 713 rows in one column. So now if we train our model We'll get coefficients just like before except that it's now a column vector also known as A rank two matrix with a trailing unit axis Okay, so that hasn't changed anything It's just repeated what we did in the previous section But it's kind of set us up to expand because now that we've done this using matrix model play We can go crazy and we can go ahead and create a neural network So with our neural network remember back to The excel days Notice here is the same thing right we created a Column vector, but we didn't create a column vector. We actually created a matrix with kind of two sets of coefficients So when we did our matrix multiply every row gave us two sets of outputs Which we then cut through value Right, which remember we just used an if statement And we added them together So our co-affs now To make a proper neural net we need one set of co-affs Here And so here they are torch.rand And co-aff by what well in excel. We just did two because I kind of got bored of Getting everything working properly, but you don't have to worry about Filling rise and creating columns and blah blah blah and in pie torch. You can create as many as you like So I made a Something you can change I call it n hidden number of hidden activations and I just set it to 20 And as before we Centralize them by making them go from minus point five to point five Now when you do stuff by hand Everything does get more fiddly If our coefficients aren't if they're too big Or too small It's not going to train it at all basically the the gradients still kind of vaguely point in the right direction But you'll jump too far or not far enough or whatever So I want my gradients to be about the same as they were before so I divide by n hidden because otherwise at the next step when I add up the the the next matrix multiply it's going to be Much bigger than it was before so It's all very fiddly So then I want to take so that's going to give me for every row. It's going to give me 20 Activations 20 values right just like an excel we had Two values because we had two sets of coefficients And so to create a neural net I now need to multiply each of those 20 things By a coefficient And this time it's going to be a column vector because I want to create one output predictor of survival So again torch dot rand And this time the n hidden will be the number of coefficients by one And again like trying to find something that Actually trains properly required me some fiddling around to figure out how much to subtract and I found if I subtract point three I could get it to train And then finally I didn't need a constant term for the first layer as we discussed because our dummy variables have you know n columns rather than n minus one columns but layer two Absolutely needs a constant term Okay, and we could do that as we discussed last time by having a column of ones Although in practice they actually find it's just easier just to create a constant term. Okay, so here is a single Scalar Random number So those are the coefficients we need one set of coefficients to go from input to hidden One goes from hidden to a single output and a constant so they're all going to need grab And so now we can change how we calculate predictions So we're going to pass in all of our coefficients so A nice thing in python is if you've got a list or a tuple of values On the left hand side you can Expand them out into variables. So this is going to be a list of three things So we'll call them l one layer one layer two and the constant term Because those are the list of three things we returned So in python If you just chuck things with commas between them like this it creates a tuple a tuple is a list. It's an immutable list So now we're going to grab those three things So step one is to do our matrix model play and as we discussed We then have to replace the negatives with zeros And then we put that through our second matrix model play So our second layer and add the constant term and remember of course at the end Chuck it through a sigmoid So here is a neural network Now update cofs previously subtracted the coefficients The gradients times the letting rate from the coefficients, but now we've got Three sets of those so we have to just chuck that in a for loop So change that as well. And now we can go ahead and train our model that are We just trained a model And how does that compare? So the loss function is a little better than before accuracy Exactly the same as before and you know, I will say it was Very annoying To get to this point trying to get these constants right and find a learning rate that worked like It was super fiddly But you know, we got there we got there and it's a very small Test set. I don't know if this is necessarily better or worse than the linear model, but it's certainly fine Um, and I think that's pretty cool that we were able to build a neural net from scratch um That's doing pretty well But I hear that all the cool kids nowadays are doing deep learning not just neural nets So we better make this deep learning. So this one only has one hidden layer So let's create one with n hidden layers So for example, let's say we want two hidden layers 10 activations in each you can put as many as you like here, right? So in a cofs now is going to have to create a torch dot rand For every one of those hidden layers And then another torch dot rand for your constant terms stick requires grad and all of them And then we can return that so that's how we can just initialize As many layers as we want of coefficients So the first one the first layer So the sizes of each one the first layer will go from n cof to 10 The second matrix will go from 10 to 10 and the third matrix will go from 10 to 1 So it's worth like working through these matrix multipliers on like a spreadsheet or a piece of paper or something to kind of convince yourself That there's the right number of activations at each point So then We need to update calc threads So that rather than doing each of these steps manually, we now need to loop through all the layers Do the matrix model play At the constant And as long as it's not the last layer Do the value Why not the last layer because remember the last layer has sigmoid So these things about like Remember what happens on the last layer. This is an important thing you need to know about you need to kind of check if things aren't working What's your This thing here is called the activation function torch dot sigmoid and f dot relu. They're the activation functions for these layers One of the most common mistakes amongst people trying to kind of create their own Um architectures Or kind of variants of architectures is to mess up their final activation Function and that makes things very hard to train So make sure we've got a torch dot sigmoid at the end and no value at the end So there's our deep learning calc threads And then just one last change Is now when we update our coefficients we go through all the layers and all the constants And again, there was so much messing around here with trying to find like exact ranges of random numbers that end up training. Okay But eventually I found some and as you can see it gets to about the same Loss and about the same Accuracy This code is worth spending time with and When the code's Inside a function it can be a little difficult to experiment with so you know what I would be inclined to do To understand this code is to kind of copy and paste this cell Make it so it's not in a function anymore And then use control shift dash to separate these out into separate cells Right and then try to kind of set it up so you can run a single layer at a time or a single coefficient like make sure you can see What's going on? Okay, and that's why we use notebooks It's so that we can experiment and it's only through experimenting like that that at least for me. I find that I can really Understand what's going on nobody can look at this code and immediately say I don't think anybody can I get it That all makes perfect sense But once you try Running through it yourself, you'll be like, oh, I see why that's as it is So, um, you know one thing to point out here is that our neural nets and deep learning models They don't particularly seem to help um so Does that mean that deep learning is a waste of time and you Just did five lessons that you shouldn't have done No, not necessarily This is a playground competition. We're doing it because it's easy to get your head around But for very small data sets like this with very very few columns and the columns are really simple You know deep learning is not necessarily going to give you the best result. In fact, as I mentioned Nothing we do It's going to be as good as a carefully designed Model that uses just the name column um So, you know, I think that's an interesting insight right is that um for the kind of data types which Have a very consistent structure like for example images Or natural language text documents quite often you can somewhat brainlessly chuck a deep learning Neural net at it and get a great result Generally for tabular data. I find that's not the case I find I normally have to think pretty long and hard about the feature engineering In order to get Good results, but once you've got good features you then want a good model and so you you know And generally like the more features you have and the more levels in your categorical features and stuff like that You know the more value you'll get from more sophisticated models But yeah, I definitely would say a an insight here Is that you know, you want to include simple baselines as well and we're going to be seeing even more of that In a couple of notebooks time So we've just seen how you can build that stuff from scratch um, we'll now see Why you shouldn't I mean, I say you shouldn't you you should to learn That why you probably won't want to in real life when you're doing stuff in real life You don't want to be fiddling around with all this annoying initialization stuff and learning rate stuff and dummy variable stuff and normalization stuff and so forth because um We can do it for you And it's not like everything's so automated that you don't get to make choices, but you want like You want to make the choice not to do Things the obvious way and have everything else done the obvious way for you So that's why we're going to look at this Why you should chose use a framework Notebook and again, I'm going to look at the clean version of it and again in the clean version of it step one is to Download the data as appropriate for the Kaggle or non-Kaggle environment And set the display options and set the random seed and read the data data frame all right now There was so much fussing around With the doing it from scratch version that I did not want to do any feature engineering because every column I added was another thing I had to think about dummy variables and normalization and Random coefficient initialization and blah blah blah but With a framework Everything's so easy You can do all the feature engineering you want Because this isn't a lesson about feature engineering instead I plagiarized entirely from this fantastic advanced feature engineering tutorial on Kaggle And what this tutorial found was that In addition to the log fare We've already done that you can do cool stuff with the deck With adding up the number of family members with the people traveling alone How many people are on each ticket And finally we're going to do stuff with the name which is we're going to grab the mr. Miss mrs. Master whatever So we're going to create a function to like do some feature engineering and if you want to learn a bit of pandas Here's some great lines of code to a step through One by one and again like take this out of a function put them into individual cells run each one Look up the Tutorials what does stir do what does map do what does group buy and transform do What does value counts do like these are all like part of the reason I put this here Was the folks that haven't done much if any pandas to have some you know Examples of functions that I think are useful and I actually refactored this code quite a bit to try to Show off some features of pandas. I think are really nice So we'll do the same random split as before So passing in the same seed And so now we're going to do the same set of steps that we did manually with fast ai So we want to create a tabular model data set Based on a pandas data frame And here is the data frame These are the um train versus validation splits. I want to use Here's a list of all the stuff I want done place Uh deal with dummy variables for me deal with fixing deal with missing values for me Normalize continuous variables for me Um, I'm going to tell you which ones are the categorical variables So here's for example p class was a number, but I'm telling fast ai to treat it as categorical Here's all the continuous variables Here's my dependent variable And the dependent variable is a category So create data loaders from that place And uh Save models right here in this directory That's it. That's all the preprocessing I need to do even with all those extra engineered features Create a learner Okay, so this remember is something that contains a model And data And I want you to put in two hidden layers with 10 units and 10 units just like we did in our final example Uh, what learning rate should I use? Uh, make a suggestion for me, please. So call lr find You can use this for any fast ai model Now what this does Is it starts at a learning rate that's very very small 10 to the negative 7 And it puts in one batch of data And it calculates the loss And then it puts through and then it increases the learning rate slightly and puts through another batch of data And it keeps doing that for higher and higher learning rates And it keeps track of the loss as it increases the learning rate Just one batch of data at a time And what happens is for the very small learning rates Nothing happens But then once you get high enough the loss starts improving And then as it gets higher it improves faster Until you make the learning rate so big that it overshoots and then it kills And so generally somewhere around here is the learning rate you want Fast ai has a few different ways of recommending a learning rate. You can look up the docs to see what they mean I generally find if you choose slide and valley And pick one between the two you get a pretty good learning rate So here we've got about 0.01 And about 0.08 So I picked 0.03 So just run a bunch of epochs The way it goes This is a bit crazy after all that we've ended up exactly the same accuracy as the last two models That's just a coincidence right? I mean there's nothing particularly about that accuracy And so at this point we can now submit that to cattle now Remember with the linear model we had to repeat All of the pre-processing steps on the test set in exactly the same way Don't have to worry about it with fast ai in fast ai I mean we still have to deal with the fill missing for fair because that's that's that we have to add our Feature engineering features, but all the pre-processing we just have to use this one function called test deal That says created data loader that contains exactly the same pre-processing steps that our learner used And that's it That's all you need. So just because you you want to make sure that your inference time Transmissions pre-processing are exactly the same as a training time So this is the magic method which does that just one line of code And then to get your predictions you just say get threads and pass in That data loader I just built And so then these three lines of code are the same as the previous notebook And we can take a look at the top and you can see There it is So how did that go? I don't remember Oh, I didn't say I I think it was again basically middle of the pack if I remember correctly So one of the nice things about Now that it's so easy to like add features and build models is we can Experiment with things much more quickly. So I'm going to show you how easy it is to experiment with You know what's often considered a fairly advanced idea which is called ensembling There's lots of ways of doing ensembling, but basically ensembling is about creating multiple models And combining their predictions And the easiest kind of ensemble to do is just to literally Just build one of the models and so each one is going to have a different set of randomly initialized coefficients And therefore each one is going to end up with a different set of predictions So I just create a function called ensemble which creates a learner exactly the same as before Fits exactly the same as before and returns the predictions And so that we'll just use a list comprehension to do that five times So that's going to create a set of five Predictions done So now we can take all those predictions and stack them together And take the mean over the rows. So that's going to give us the uh, what's actually sorry the mean over the Over the first dimension. So the mean over the sets of predictions And so that will give us the average prediction of our five models We can turn that into a csv And submit it to cattle And that one I think that went a bit better. Let's check Yeah, okay. So that one actually finally gets into the top 20 25 percent in the competition So, I mean not amazing by any means but you can see that, you know, this simple step of creating five independently trained models just starting from different starting points in terms of random coefficients um actually improved us from Top 50 percent to top 25 percent John Is there an argument because you've got a categorical result you're zero one effectively Is there an argument that you might use the mode of the ensemble rather than the numerical mean? I mean Yes, there's an argument that's been made And um, you know something I would just try I generally find it's less good but Not always and I don't feel like I've got a great intuition as to why and I don't feel like I've seen any studies as to why You could predict like there's a few there's there's at least three things you could do right You could take the is it greater or less than point five ones and zeros and average them Or you could take the mode of them Or you could take the actual probable probability predictions and take the average of those and then threshold that And I've seen examples where certainly Both of the different averaging versions each of them has been better I don't think I've seen one with the modes better, but um, that was very popular back in the 90s, um, so Yeah, so it'd be so easy to try you may as well give it a go Okay, we don't have time to finish the next notebook, but let's make a start on it So the next notebook Is random forests how random forests really work Who here has heard of random forests before? Nearly everybody. Okay. So very popular um, developed I think initially in 1999, but you know gradually improved in popularity during the 2000s. I was like Everybody kind of knew me as mr. Random Forests for for years um, I implemented them Like a couple of days after the original technical report came out. I was such a fan All of my early Kaggle results around the forests I love them And I think hopefully you'll see why I'm such a fan of them because they're so elegant And they're almost impossible to mess up um A lot of people will say like Oh, why are you using machine learning? Why don't you use something simple? like logistic regression And I think like oh gosh in industry I've seen far more examples of people screwing up logistic regression than successfully using logistic regression because it's very very very very difficult To do correctly, you know, you've got to make sure you've got the correct transformations and the correct interactions and the correct Outlier handling and blah blah blah and anything you get wrong the entire thing falls apart um random forests I It's very rare to that. I've seen somebody screw up a random forest in industry. They're very hard to screw up because they're They're so resilient um, and just see why So in this notebook um Just by the way rather than importing numpy and pandas and matplotlib and blah blah blah There's a little handy shortcut, which is if you just import everything from fast ai dot imports That imports all the things that you normally want So, I mean Doesn't do anything special, but it's just save so messing around So again, we've got our cell here to grab the data and I'm just going to do some basic preprocessing here With my fill in a for the fair only needed for the test set, of course Um Grab the modes and do the fill in a on the in the modes take the log fair um, and then I've got a couple of new steps here Which is um converting embarked and sex Into categorical variables, what does that mean? Well, let's just run this on both the data frame a data frame and the test data frame Uh split things into set into categories and continuous And um sex is a categorical variable. So let's look at it Well, that's interesting. It looks exactly the same as before Male and female, but now it's got a category And it's got a list of categories What's happened here? Well, what's happened is Pandas has made a list of all of the unique values of this field And behind the scenes if you look at the cat codes you can see behind the scenes It's actually turned them into numbers It looks up this one Into this list to get male Looks up this zero into this list to get female. So when you print it out it prints out the friendly version But it stores it as numbers now um You'll see in a moment why This is helpful, but a key thing to point out is we're not going to have to create any dummy variables and even that First second or third class. We're not going to consider that Categorical at all And just see why in a moment a random forest Is an ensemble of trees? A tree is an ensemble of binary splits And so we're going to work from the bottom up. We're going to first work We're going to first learn about what is a binary split And we're going to do it by looking at an example Let's consider What would happen if we took all the passengers on the titanic And grouped them into males and females And let's look at two things. The first is let's look at their survival rate So about 20 survival rate for males and about 75 percent for females And let's look at the histogram. How many of them are there about twice as many males as females? Consider what would happen if you created the world's simplest model Which was what sex are they? Oh, it wouldn't be bad, would it? Because there's a big difference between the males and the females. A huge difference in survival rate So if we said, oh If you're a man, you probably died. If you're a woman, you probably survived Or not just a man or a boy, so a male or a female That would be a pretty good model because it's done a good job Of splitting it into two groups that have very different survival rates Um This is called a binary split a binary split is something that splits the rows into two groups Hence binary Let's talk about another example of a binary split I'm getting ahead of myself Before we do that Let's look at what would happen if we used this model So if we created a model which just looked at sex How good would it be? So to figure that out we first have to split into training and test sets So let's go ahead and Do that And then let's convert All of our categorical variables into their codes. So we've now got zero one two whatever We don't have male female there anymore And let's also create something that's returns the Um independent variables which will call the x's and the dependent variable Which will call y And so we can now get the x's and the y for each of the training set and the validation set And so now let's create some predictions We'll predict that they survived if their Sex is zero. So if they're female So how good is that model? Um, remember I told you that um To calculate mean absolute error we can get um Psychic learn or pytorch whatever do it for us instead of doing it ourselves. So just showing you Here's how you do it just by importing it directly This is exactly the same as the one we did manually in the last notebook So that's a 21 and a half percent error So that's a pretty good model Um Could we do better? Well, here's another example. What about fair? So fair is different to sex because fair is continuous or log fair I'll take But we could still split it into two groups So here's for all the people that didn't survive This is their median fair here And then this is their quartiles For bigger fairs and quartiles for smaller fairs And here's the median fair for those that survived And their quartiles. So you can see the median fair for those that survived is higher than the median fair for those that didn't We can't create a histogram exactly for Fair because it's continuous We could bucket it into groups to create a histogram. So I guess we can create a histogram That wasn't true. Um, what I should say is we could create something better Which is a kernel density plot, which is just like a histogram, but it's like with infinitely small bins So we can see most people have a log fair of about two um So what if we split on about a bit under three? You know that seems to be a point at which there's a difference in survival Between people that are greater than or less than that amount So here's another model log fair greater than point two point seven Oh much worse point three three six point two and five Well, I don't know. Maybe there's something better um We could create a little interactive tool So, um What I want is something that can give us a quick score Of how good a binary split is And I want it to be able to work regardless of whether we're dealing with categorical or continuous or whatever data So I just came up with a simple little Way of scoring Which is I said, okay, if you split your data into two groups A good split would be one in which all of the values of the dependent variable on one side Are all pretty much the same and all of the dependent variables on the other side are all pretty much the same for example if pretty much all the males had the same survival outcome, which is Didn't survive and all the females had about the same survival outcome, which is they did survive That would be a good split, right? It doesn't just work for categorical variables. It would work if your dependent variable was Continuous as well You basically want each of your groups within group to be as similar as possible on the dependent variable And then the other group you want them to be as similar as possible on the dependent variable So how similar is all the things in a group? That's the standard deviation So what I want to do to get is basically add the standard deviations of the two groups of the dependent variable And then if there's a Really small standard deviation, but it's a really small group That's not very interesting. So I'll multiply it by the size All right, so this is something which says what's the score for one of my groups one of my sides It's the standard deviation multiplied by how many things are in that group So the total score Is the score for the left-hand side to all the things in one group Plus the score for the right-hand side, which is tilde means not so not left-hand side Is right-hand side and then we'll just take the average of that So for example If we split by sex Is greater than or less than 0.5 that'll create two groups males and females and that gives us this score And if we do log fair Greater than or less than 2.7. He gives us this score and lower scores better. So sex is better than log fair So now that we've got that we can use our favorite interact tool To create our little gooey And so we can say, you know, let's try like oh, what about this one? Can we Uh, oops Can we find something that's a bit better 4.8 0.85? No, not very good What about p-class? 0.468 0.460 so we can fiddle around with these Um, we could do the same thing for the categorical variables So we already know that sex We can get to 0.407. What about embarked? Hmm. All right, so it looks like sex might be our best Well, that was pretty inefficient, right? It would be nice if we could find some automatic way to do all that Well, of course we can for example, if we wanted to find what's the best split point for age Then we just have to create Let's do this again If we want to find the best split point for age, we could just like create a list of all of the unique values of age And try each one in turn And see what score we get if we made a binary split on that level of age So here's a list of all of the possible binary split thresholds for age Let's go through all of them for each of them Calculate the score and then NumPy and PyTorch have an arg min function which tells you what index into that list is the smallest So just to show you here's the scores and zero one two three four five six Oh here sorry zero one two three four five six. So apparently that That value has the smallest score So that tells us that for age The threshold of six would be best so Here's something that just calculates that for a column. It calculates the best split point So here's six right And it also tells us what the score is at that point Which is point four seven eight so now we can just go through and Calculates the score for the best split point for each column And if we do that we find that the lowest Score is sex So that is how we calculate the best binary split So we now know that the model that we created earlier This one It's the best single binary split model we can find So next week we're going to be we're going to learn how we can recursively do this to create a decision tree and then do that multiple times to create a random forest, but before we do I want to point something out Which is this ridiculously simple thing which is find a single binary split and stock Is a type of model it has a name. It's called one r And the one r model it turned out in a review of machine learning methods in the 90s turned out to be One of the best if not the best machine learning classifiers for a wide range of real-world datasets So that is to say don't Assume that you have to go complicated It's not a bad idea to always start creating a baseline of one r a decision tree with a single binary split and in fact For the titanic competition. That's exactly what we do if we look at the titanic competition on kaggle You'll find that what we did is our sample submission Is one that just splits into male versus female All right. Thanks everybody. I hope you found that interesting and I will see you next lesson. Bye