 Okay, so we're here at the Melbourne R Meetup and we are talking about some techniques that Jeremy Howard has used to do as well as he can in a variety of Kaggle competitions. And we're going to start by having a look at some of the tools that I've found useful in predictive modeling in general and in Kaggle competitions in particular. So I've tried to write down here what I think are some of the key steps. So after you download data from a Kaggle competition, you end up with CSV files, generally speaking CSV files, which can be in all kinds of formats. So here's the first thing you see when you open up the time series CSV file. It's not very helpful, is it? So each of these columns is actually, oh here we come, is actually quarterly time series data. And so because, well for various reasons, it's kind of each one's different lengths and they kind of start further along, the particular way that this was provided didn't really suit the tools that I was using. In fact, if I remember correctly, that's already, I've already adjusted it slightly because it originally came in rows rather than columns. Yeah, that's right, this is how it originally came in rows rather than columns. So this is where this kind of data manipulation tool box comes in. There's all kinds of ways to swap rows and columns around, which is where I started. The really simple approach is to select the whole lot, copy it, and then paste it and say transpose in Excel, and that's one way to do it. And then having done that, I end up with something which I could open up in, let's have a look at the original file. This is the original file in Vim, which is my text editor of choice. This is actually a really good time to get rid of all those kind of leading commas because they kind of confuse me. So this is where stuff like Vim is great, even things like Notepad++ and Emacs and any of these kind of power user text editors will work fine. As long as you know how to use regular expressions, and if you don't, I'm not going to show you now, but you should definitely look it up. So in this case, I'm just going to go, okay, let's use a regular expression. So I say yes to substitute on the whole file, start with any number of commas, and replace it with nothing at all. And you can see Vim and I'm done. All right, so I can now save that, and I've got a nice easy file that I can start using. So that's why I've listed this idea of data manipulation tools in my tool box. And to me, Vim or some regular expression power text editor, which can handle large files, is something to be familiar with. So just in case you can catch that, that is regular expressions. Probably the most powerful tool for doing text and data manipulation that I know of, sometimes they're just called regexes. The most powerful types of regular expressions, I would say, would be the ones that are in Pell. They've been widely used elsewhere. Any C program that uses the PCRE engine has the same regular expressions as Pell or less. C sharpen.net have the same regular expressions as Pell more or less. So this is a nice example of one bunch of people getting it right and everybody else plagiarizing. Vim's regular expressions are slightly different, unfortunately, which annoys me no end, but they still do the job. So yeah, make sure you've got a good text editor that you know well how to use. Something with a good macro facility is nice as well. Again, Vim's great for that. You can kind of record a series of keystrokes and hit a button and it repeats it basically on every line. I also wrote Pell here because, to me, Pell is a rather unloved programming language. But if you think back to where it comes from, it was originally developed as the Swiss Army chainsaw of text processing tools. And today that is something it still does, I think, better than any other tool. It has amazing command line options. You can pass to it that do things like run the following command on every line in a file or run the following line in every command in the file in any print about command. There's a command line option to back up each file before changing it to large up back. I find with Pell, I can do stuff which would take me much, much longer time in any other tool. Even simple little things like hacking and some data on the weekend where I had to concatenate a whole bunch of files, but only the first one I wanted to keep the first line. Because there are a whole bunch of CSV files in which they were ahead of line I had to delete. So in Pell, in fact, it's probably still going to be sitting here in my history, yeah. So in Pell, that's basically minus N means to do this on every single row. Minus A means I'm not even going to write a script file. I'm going to give you the thing to do it right here on the command line. And here's a piece of rather difficult to comprehend Pell. But trust me, what it says is if the line number is greater than one, then print that line. So here's something to strip the first line from every file. So this kind of stuff you can do in Pell is great. And I see a lot of people in the forums who complain about the format of the data wasn't quite what I expected or not quite convenient. Can you please change it for me? And I always think, well, this is part of data science. This is part of data hacking. This is data munging or data manipulation. There's actually a really great book called, I don't know if it's hard to find nowadays, but I loved it called Data Munging in Pell. And it's a whole book about all the cool stuff you can do in Pell in a line or two. So OK, I've now got the data into a form where I can kind of load it up into some tool and start looking at it. What's the tool I normally start with? I normally start with Excel. Now, your first reaction might be to think, Excel, not so good for big files. To which my reaction would be, if you're just looking at the data for the first time, why are you looking at a big file? Start by sampling it. And again, this is the kind of thing you can do in your data manipulation piece, that thing I just showed you in Pell. If that's said, RAND is greater than 0.9 and AND-PRIG, that's going to sample every 10 rows, more or less. So get your, if you've got a huge data file, get it to a size that you can easily start playing with it, which normally means you're random sampling. So I like to look at it in Excel. And I will show you for a particular competition how I go about doing that. So let's have a look, for example, at a couple. So here's one which the New South Wales government recently ran, which was to predict how long it's going to take cars to travel along each segment of the M4 motorway in each direction. The data for this is a lot of columns, because every column is another route, and lots and lots of rows, because every row is another two-minutely observation, and very hard to get a feel for what's going on. There were various terrific attempts on the forum at trying to create animated pictures of what the road looks like over time. I did something extremely low-tech, which is something I'm proud of spending a lot of time doing, is things extremely low-tech, which is I created a simple little macro in Excel which selected each column. And then went conditional formatting color scales, red to green. And I ran that on each column, and I got this picture. So here's each route on this road, and here's how long it took to go on that route at this time. And isn't this interesting, because I can immediately see what traffic jams look like. See how they kind of flow as you start getting a traffic jam here? They flow along the road as time goes on. And you can then start to see at what kind of times they happen and where they tend to start. So here's a really big jam. And it's interesting, isn't it? So if we go into Sydney in the afternoon, then obviously you start getting these jams up here. And during the afternoon, you can see the jam moving so that at 5 PM, there's actually a couple of them. And at the other end of the road, it stays jammed until everybody's cleared up through the freeway. So you get a real feel for it. And even when it's not PKR, and even in some of the areas which aren't so busy, you can see that's interesting. There are basically parts of the freeway which out of PKR, they're basically constant travel time. And the colors are immediately showing me. You see how easy it is. So when we actually got on the phone with the RTA to take them through the winning model, actually the people that won this competition were kind enough to organize a screencast with all the people in the RTA and from Kaggle to show the winning model. And the people from RTA said, well, this is interesting because you tell me in your model, they said, what we looked at was we basically credit the model but looked at for a particular time, for a particular route, we looked at the times and routes just before and around it on both sides. And I remember one of the guys said, that's weird because normally these kind of queues traffic jams only go in one direction. So why would you look at both sides? And so I was able to quickly say, okay guys, that's true, but have a look at this. So if you go to the other end, you can see how sometimes although queues kind of form in one direction, they can kind of slide away in the other direction. For example, so by looking at this kind of picture, you can see what your model is going to have to be able to model. So you can see what kind of inputs it's going to have and how it's going to have to be set up. And you can immediately see that if you credit the model that basically try to predict each thing based on the previous few periods of the routes around it, whatever modeling technique you're using, you're probably going to get a pretty good answer. And interestingly, the guys that won this competition, this is basically all they did, really nice, simple model. They used random forest, as it happens, which we'll talk about soon. They added a couple of extra things, which was I think the rate of change of time. But that was basically it. So really good example of how visualization can quite quickly tell you what you need to do. I'll show you another example. This is a recent competition that was set up by the dataists.com blog. And what it was was they wanted to try and create a recommendation system for R packages. So they got a bunch of users to say, okay, this user for this package doesn't have an install. This user for this package does have an install. So you can kind of see how this is structured, right? They added a bunch of additional potential predictors for you. How many dependencies does this package have? How many suggestions does this package have? How many imports? How many of those task views on CRN? Is it included in? Is it a core package? Is it a recommended package who maintains it? And so forth. So I found this not particularly easy to get my head around what this looks like. So I used my number one most favorite tool. For data visualization and then hoc analysis, which is a pivot table. A pivot table is something which dynamically lets you slice and dice your data. If you've used maybe Tableau or something like that, you'll know the feel. This is kind of like Tableau. It doesn't cost $1,000. No, I mean Tableau's got cool stuff as well. But this is fantastic for most things I find I need to do. And so in this case, I simply drag user ID up to the top and I've dragged a package name down to the side and just quickly threw this into a matrix, basically. And so you can see here what this data looks like, which is that those nasty people at data.us.com has deleted a whole bunch of things in this matrix. So that's the stuff that they want to predict. And then we can see that generally, as you expect, there's ones and zeros. There's some weird shit going on here where some people have things apparently there twice, which suggests to me maybe there's something funny with data collection. And there's other interesting things. There are some things which seem to be quite widely installed. Most people don't install most packages. And there is this mysterious user number five who is the world's biggest package slut. He or she installs everything that they can and I can only imagine that ADACGH is particularly hard to install because not even user number five managed to get around to it. So you can see how kind of creating a simple little picture like this, I can get a sense of what's going on. So I took that data in the package from competition and I kind of thought, well, if I just knew for a particular, so let's say this empty cell is the one we're trying to predict. So if I just knew in general, how commonly acceptance sampling was installed and how often user number one installed stuff, I probably got a good sense of the probability of user number one installed in acceptance sampling. So to me, one of the interesting points here was to think actually, I don't think I'd care about any of this stuff. So I jumped into R and all I did was I basically said, okay, read that CSV file in. There's a whole bunch of rows here because this is my entire solution. But I'm just going to show you the rows I used for solution number one. So read in the whole lot. Although user is a number, treated as a factor because user number one is not 50 times worse than user number 50, those truths and falses turn them into one to zero to make life a bit easier. And now to apply mean, the mean function to each user across their installations and apply the mean function to each package across that package's installations. So now I've got a couple of lookups basically that tell me user number 50 installs this percent of packages. This particular package is installed by this percent of users. And then I just stuck them basically back into my file of predictors. So I basically did these simple lookups for each row to find lookup the user and find out for that row, the mean for that user and the mean for that package. And that was actually it. At that point, I then created a GLM in which I had, in which I had, whoopsie, I created a GLM in which obviously I had my ones and zeros of installations as the thing I was predicting. And my first version I had UP and PP, so these two probabilities as my predictors. In fact, no, in first version it was even easier than that. All I did, in fact, was I took the max of those two things. So P max, if you're not familiar with R it's just something that does a max on each row individually. In R, nearly everything works on vectors by default, except for max. So that's why we had to use P max. Something well worth knowing. So I just took the max. So this user installs 30% of things and this package is installed by 40% of users. So the max of the two is 40%. And I actually created a GLM with just one predictor. The benchmark that was created by the data's people for this use the GLM on all of those predictors, including all kinds of latent analysis of the manual pages and maintain names and God knows what, and they had a AUC of 0.8. This five line of code thing had an AUC of 0.95. So, you know, the message here is, don't overcomplicate things. If people give you data, don't assume that you need to use it and look at pictures. So if we have a look at kind of my progress in there, so here's my first attempt, which was basically to multiply the user probability by the package probability. And you can see one of the nice things in Kaggle is you get a history of your results. So here's my 0.84, you see. And then I changed it to using the maximum of two and there's my 0.95 AUC. And I thought, oh, that was good. Imagine how powerful this will be when I use all that data that they gave us with a fancy random forest and it went backwards. Okay, so you can really see that actually a bit of focused simple analysis can often take you a lot further. So if we look to the next page, we can kind of see where, you know, I kind of kept thinking random forests, they ought to buddy works, a bit more random forest and that went backwards. And then I started adding in a few extra things. And then actually I thought, you know, there is one piece of data which is really useful, which is that dependency graph. If somebody has installed package A and it depends on package B and I know they've got package A, then I also know they've got package B. So I added that piece. That's the kind of thing I find a bit difficult to do in R because I think R's a slightly shit programming language. So I did that piece in the language which I quite like to see shot, imported it back to R. And then as you can see, each time I send something off to Kaggle, I generally copy and paste into my notes just the line of code that I ran. So I can see exactly what it was. So here I added this dependency graph and I jumped up to 0.98. That's basically as far as I got in this competition, which was enough for six place. I made a really stupid mistake. Yes, if somebody has package A and it depends on package B, then obviously that means they've got package B. I did that. If somebody doesn't have package B and package A depends on it, then you know they definitely don't have package A. I forgot that piece. And so when I went back and put that in after the competition was over and I realized I'd forgotten it, I realized I could have come about second if I'd just done that. So in fact, to get in the top three in this competition, that's probably as much modeling as you needed. So I think you can do well in these comps without necessarily being an R expert or necessarily being a stats expert, but you do need to kind of dig into the toolbox appropriately. So let's go back to my extensive slide presentation. So you can see here we talked about data manipulation about interactive analysis. We've talked a bit about visualizations and I include there even simple things like those tables we did. As I just indicated, in my toolbox is some kind of general purpose program tool. And to me, there's kind of three or four clear leaders in this space. And I know from speaking to people in the data science world about half the people I speak to don't really know how to program. You definitely should because otherwise all you can do is use stuff that other people have made for you. And I would be picking from these tools. So I like the highly misunderstood C-sharp and I would combine it with these particular libraries for, yes, Christian? Yeah, I was wondering whether you saw complimentary or competing? Yeah, complimentary. And I'll come to that in the very next bullet point. Yes. So yeah, so this general purpose program tools is for the stuff that R doesn't do that well. And even the guy that wrote Rostlokko says he's not that fun nowadays of various things of R as a kind of underlying language. I mean, whereas there are other language languages which are just so powerful and so rich and so beautiful. I should have actually included some of the functional languages in here too, like Haskell would be another great choice. But if you've got a good powerful language, a good powerful matrix library and a good powerful machine learning toolkit, you're doing great. So Python is fantastic. Python also has a really, really nice REPL. A REPL is like where you type in a line of code like an R and it immediately gives you the results and you can keep moving through like that. You can use PyPython, which is a really fantastic REPL for Python. And in fact, the other really nice thing in Python is Matplotlib, which gives you a really nice chart in library. Much less elegant, but just as effective for C-sharp and just as free is the MS chart controls. I've written a kind of a functional layer on top of those to make them easier to do analysis with, but they're super fast and super powerful, so that only takes 10 minutes. If you use C++, that also works great. There's a really brilliant thing, very, very underutilized called IGN, which originally came from the KDE project and just provides an amazingly powerful kind of vector and scientific programming kind of language on top of C++. Java to me is something that used to be on a par with C-sharp back in the 1.0.1.1.debt days. It's looking a bit sad nowadays, but on the other hand, it has just about the most powerful general purpose machine learning library on top of it, which is weaker. So there's a lot to be said for using that combination. In the end, if you're a data scientist who doesn't yet know how to program, my message is learn a program. And I don't think it matters too much, which one you pick. I would be picking one of these, but without it, you're gonna be struggling to go beyond what the tools provide. Question at the back. The question at the back has an offer to roast out a jump. That is a big available environment. Yeah, okay. So the question was about visualization tools and equivalent to SAS Jump. Fairly available. Yeah, I would have a look at something like G-Gobi. G-Gobi is a fascinating tool, which kind of has, and not free, but in the same kind of area as we talked about, Tableau. It supports this concept of brushing, which is this idea that you can create a whole bunch of plots and scatter plots and parallel coordinate plots and all kinds of plots, and you can highlight one area of one plot and it'll show you where those points are and all the other plots. And so in terms of kind of really powerful visualization libraries, I think G-Gobi would be where I would go. Having said that, it's amazing how little I use it in real life. Because things like Excel and what I'm about to come to, which is G-G plot two, although much less fancy than things like Jump and Tableau and G-Gobi, support a kind of hypothesis given problem-solving approach very well. Something else that I do is I tend to try to create visualizations, which meet my particular needs. So we talked about the time series problem and the time series problem is one in which I used a very simple 10 line JavaScript piece of code to plot every single time series in a huge big mess like this. Now, you kind of might think, well, if you're plotting hundreds and hundreds of time series, how much insight are you really getting from that? But I found it was amazing how just scrolling through hundreds of time series, how much my brain picked up and what I then did was when I started modeling this was I then turned these into something a bit better, which was to basically repeat it. At this time, I showed both the orange, which is the actuals and the blues, which is my predictions. So I actually, and then I put the metric of how successful this particular time series was. So I kind of found that using more focused kind of visualization development. In this case, I could immediately see whereabouts were these, which numbers were high. So here's one here, 0.1, that's a bit higher than the others. And I could immediately kind of see what have I done wrong and I could get a feel of how my modeling was going straight. So I tend to think you don't necessarily need particularly sophisticated visualization tools. They just need to be fairly flexible and you need to know how to drive them to give you what you need. So, you know, through this kind of visualization, I was able to make sure every single chart in this competition, if it wasn't matching well, and I'd look at it and say, yeah, it's not matching well because there was just a shock in some period which couldn't possibly be predicted. So that's okay. And so this was one of the competitions that I won. And I really think that this visualization approach was key. So I mentioned I was going to come back to one really interesting plotting tool, which is GGPlot2. GGPlot2 is created by a particularly amazing New Zealander who seemed to have more time than everybody else in the world combined and creates all these fantastic tools. Thank you, Hadley. I was going to just show you what I meant by really powerful, but kind of simple plotting tool. Here's something really fascinating, right? You know how creating scatter plots with lots and lots of data is really hard because you end up with just big black blocks, right? So here's a really simple idea, which is why don't you give each point in the data a kind of a level of transparency so that the more they sit on top of each other, it's like transparent disks stacking up and getting darker and darker. So in the amazing art package called GGPlot2, you can add... So here's something that says plot the carrots of a diamond against its price. And I want you to vary... It's called the alpha channel to the graphics deeks amongst you. You know that means kind of the level of transparency. And I want you to basically set the alpha channel for each point to be 1 over 10, or 1 over 100, 1 over 200. And you end up with these plots which actually show you kind of the heat, you know, the amount of that area. And it's just so much better than any other approach to scatter plots that I've ever seen. So simple. And just one little line of code in your GGPlot2. I'll show you another couple of examples. And this, by the way, is in a completely free chapter of the book that he's got up on his website. There's a fantastic book. You should definitely buy it by the author of the package about GGPlot2. But this one and most important chapter is available free on his website. So check it out. I'll show you another couple of examples. Everything's done just right. Here's a simple approach of plotting a lower smoother through a bunch of data, always handy. But every time you plot something, you should see the confidence intervals. No problem. This does it by default. The best kind of plot kind of thing you want to see normally is a lower smoother. So if you ask for a fit, it gives you the lower smoother by default. It gives you the confidence interval by default. So it's kind of, it makes it hard to create really bad graphs in GGPlot2, although some people have managed, I've noticed. Things like box plots all stacked up next to each other. It's such an easy way of seeing, in this case, how the color of diamonds varies. They've all got roughly the same median that some of them have really long tails in their prices. What a really powerful plotting device. And it's so impressive that in this chapter of the book, he shows a few options. Here's what would happen if you used a jitter approach. And he's got another one down here, which is here's what would happen if you use that alpha transparency approach and only can really compare the different approaches. So GGPlot2 is something which, and I'll scroll through these, you can see what kind of stuff you can do. It's a really important part of the toolbox. Here's another one I love, right? Okay, so we do lots of scatter plots. And scatter plots are really powerful. And sometimes you actually want to see how if the points are kind of ordered chronologically, how did they change over time? So one way to do that is to connect them up with lime. Pretty bloody hard to read. So we take this exact thing, but just add this simple thing. Set the color to be related to the year of the date. And then bang. Now you can see by following the color exactly how this is sorted. And so you can see we've got here's one end here, and here's one end here. So GGPlot again has done fantastic things to make us understand this data more easily. One other thing I will mention is carrot. How many people here have used the carrot package? So I'm not going to show you carrot, but I will tell you this. If you go into R, and you type some model equals train on my data, create an SVN, what it looks like. It's got a command called train, and you can pass in a string, 300 different, I think it's about 300 different possible models, classification and regression models. And then you can add various things in here about saying I want you to center the data first, please, and I'll do a PCDA on it first, please. And it just, you know, it kind of puts all of the pieces together. It can do things like remove columns from the data, which like hardly vary at all, and therefore are used for some modeling. You can do that automatically. It can automatically remove columns in the data that are hardly linear. But most powerfully, it's got this wrapper that basically lets you take any of hundreds and hundreds of R's most powerful algorithms, really hard to use, and they all now can be done through one command. And here's the call bit, right? Imagine we're doing an SVN. I don't know how many of you have tried to do SVNs, but they're really hard to get a good result because they depend so much on the parameters. In this version, it automatically does a grid search to find, automatically find the best parameters. So you just create one command and it does SVN for you. So you definitely should be using care. There's one more thing in the toolbox I wanted to mention, which is you need to use some kind of version control tool. How many people here have used a version control tool like Git, CVS, SVN? So let me give you an example from our terrific designer at Kaggle. He's recently been changing some of the HTML on our site and you checked it into this version control tool that we use. And it's just so nice, right? Because I can go back to any file now and I can see exactly what was changed and when. And then I can go through and I can say, okay, I remember that thing broke at about this time. What changed? Oh, I think it was this file here. Okay, that line was deleted. This line was changed. This section of this line was changed. Okay. And you can see with my version control tool, it's keeping track of everything I can do. Can you see how powerful this is for modeling? Because you go back through your submission history at Kaggle and you say, oh, shit, I used to be getting 0.97 AUC. Now I'm getting 0.93. I'm sure I'm doing everything the same. Go back into your version control tool and have a look at the history. So the commits list. And you can go back to the date where Kaggle shows you that you had a really shit-hard result and you can't now remember how the hell you did it. And you go back to that date and you go, oh, yeah, it's this one here. And you go and you have a look and you see what changed. And it can do all kinds of cool stuff like you can merge back in results from earlier pushes or you can undo the change you made between these two dates, so on and so forth. And most important, at the end of the competition, when you win and Anthony sends you an email, it's fantastic, send us your winning model and you go, oh, I don't have the winning model anymore. No problem. You can go back into your version control tool and ask for as it was on the day that you had that fantastic answer. So there's my toolkit. There's quite a lot of other things I wanted to show you, but I don't have time to do. So what I'm going to do is I'm going to jump to this interesting one, which was about predicting which grants would be successful or unsuccessful at the University of Melbourne based on data about the people involved in the grant and all kinds of metadata about the application. This one's interesting because I won it by a fair margin, kind of the 0.967 is kind of 25% of the available error. It's interesting to think, what did I do right this time and how did I set this up? Basically what I did in this was I used a random forest. So I'm going to show you guys a bit about random forests. What's also interesting in this is I didn't use R at all. That's not to say that R couldn't have come up with a pretty interesting answer. The guy who came second in this comp used SAS, but I think he used like 12 gig of RAM, multi-core, huge thing. Mine ran on my laptop in two seconds. So I'll show you an approach which is very efficient as well as being very powerful. Can I be opened? Why not? Okay, so I did this all in C sharp and the reason that I didn't use R for this is because the data was kind of complex. Each grant had a whole bunch of people attached to it. It was done in a denormalised form. I don't know how many of you guys are familiar with normalisation strategies, but basically denormalised form basically means you had a whole bunch of information about the grant, kind of the dates and blah, blah, blah. And then there was a whole bunch of columns about person one. Did they have a PhD? And then there's a whole bunch of columns about person two and so forth for I think it's about 13 people. Very, very difficult to model this extremely wide and extremely messy data set. It's the kind of thing that general purpose computing tools are pretty good at. So I pulled this into C sharp and created a grants data class where basically I went, okay, read through this file and I created this thing called grants data and for each line, I split it on a comma and I added that grant to this grants data. For those people who maybe aren't so familiar with general purpose computing, general purpose programming languages, you might be surprised to see how readable they are. This idea I can say for each something in lines.select the lines split by comma. If you haven't used anything with four threads, you might be surprised that something like C sharp is a file dot read lines dot skip some lines. This is just to skip the first line a bit better. And in fact, later on I discovered the first couple of years of data were not very predictive of today, so I actually skipped all of those. And the other nice thing about these kind of tools is, okay, what does this dot add do? I can whack one button and bang, I'm into the definition of dot add. You know, these kind of IDE features are really helpful and this is equally true of most Python and Java and C++ editors as well. So the kind of stuff that I was able to do here was to create all kinds of interesting derived variables like here's one called max year birth. So this one is one that goes through all of the people on this application and finds the one with the largest year of birth. Okay, again, it's just a single line of code. If you kind of get round the kind of curly brackets and everything's like that, the actual logic is extremely easy to understand, you know, things like do any of them have a PhD? Well, if there's no people in it, none of them do. Otherwise, oh, this is just one person has a PhD down here somewhere I've got, I can find it. Any has PhD. Bang, straight to there. There you go. Does any person have a PhD? So I create all these different derived fields. I use pivot tables to kind of work out which one seemed to be quite predictive before I put these together. And so what did I do with this? Well, I wanted to create a random forest from this. Now, random forests are a very powerful, very general purpose tool, but the R implementation of them has some pretty nasty limitations. For example, if you have a categorical variable, in other words, a factor, it can't have any more than 32 levels. If you have a continuous variable, so like an integer or a double or whatever, it can't have any nulls. So there are these kind of nasty limitations that make it quite difficult. It's particularly difficult to use in this case because things like the RFCD codes had hundreds and hundreds of levels and all the continuous variables were full of nulls. And in fact, if I remember correctly, even the factors aren't allowed to have nulls, which I find a bit weird because to me null is just another factor, right? They're male or they're female or they're unknown. It's still something I should get a model on. So I created a system that basically made it easy for me to create a data setup model on. So I made this decision. I decided that for, here we are, that for doubles that had nulls in them, I created something which basically simply added two rows, sorry, two columns. One column which was, is that column null or not? One or zero. And another column which is the actual data from that column, so whatever it was, 2.3, so that one's a null, so 2.3, 6, blah, blah, blah, blah. And wherever there was a null, I just replaced it with the median. So I now had two columns where I used to have one and both of them are now modelable. Why is that, why the median? Actually, it doesn't matter because every place where this is the median, there's a one over here. So in my model, I'm going to use this as a predictor. So if all of the places that that data column was originally null, all meant something interesting, then it'll be picked up by this, is null version of the column. So to me, this is something which I do, which I did automatically because it's clearly the obvious way to deal with null values. And then as I said in the categorical variables, I just said, okay, the factors, if there's a null, just treat it as another level. And then finally in the categorical, in the factors, I said, okay, take all of the levels and if there are more observations than, I think it was 25, then keep it. Or maybe it's more than that. I think if there's more levels, maybe if there's more observations in 100, then keep it. If there's more observations in 25, less than 100, and it was quite predictive. In other words, that level was different to the others in terms of application success, then keep it. Otherwise, merge all the rest into one super level called the rest. So that way I basically was able to create a data set which actually I could then feed to R. Although I think in this case, I actually ended up using my own random forest implementation. So should we have a quick talk about random forests and how they work? So to me, there's kind of basically two main types of model, right? There's these kind of parametric models, models with parameters, things where you say, oh, this bit's linear and this bit's interactive with this bit and this bit's kind of logarithmic and I specify, you know, how I think this system looks and all the modeling tool does is it fills in parameters. Okay, this is the slope of that linear bit. This is, you know, the slope of that logarithmic bit. This is how these two can interact. So things like GLM, very well-known parametric tools. Then there are these kind of non-parametric or semi-parametric models which are things where I don't do any of that. I just say, here's my data. I don't know how it's related to each other. Just build a model. And so things like support vector machines, neural nets, random forests, decision trees all have that kind of flexibility. Non-parametric models are not necessarily better than parametric models. I mean, think back to that example of the package competition where really all I wanted was some weights to say how does this kind of, this max column relate? And all you really wanted some weights all you wanted some parameters and so GLM is perfect. GLMs certainly can overfit but there are ways of creating GLMs that don't. For example, you can use stepwise regression or the much more fancy modern version you can use GLM net which is basically another tool for doing GLMs which doesn't overfit. But anytime you don't really know what the model form is, this is where you use a non-parametric tool. And random forests are great because they're super, super fast and extremely flexible and they don't really have any parameters you have to tune so they pretty hard to get it wrong. So let me show you how that works. A random forest is simply, in fact we shouldn't even use this term random forest because a random forest is a trademark term. So we will call it a ensemble of decision trees. And in fact the trademark term random forest, I think that was 2001 that wasn't where this ensemble of decision trees was invented. It was all the way back to 1995. In fact it was actually kind of independently developed by three different people in 95, 96 and 97. The random forest implementation is really just one way of doing it. It all rests on a really fascinating observation. Which is that if you have a model that is really, really, really shit but it's not quite it's slightly better than nothing. And if you've got 10,000 of these models that are all different to each other and they're all shit in different ways but they're all better than nothing the average of those 10,000 models will actually be fantastically powerful as a model of its own. So this is the wisdom of crowds or ensemble learning techniques. You can kind of see why right it's out of these 10,000 models they're all kind of crap in different ways they're all a bit random right, they're all a bit better than nothing. 9,999 of them might basically be useless but one of them just happened upon the true structure of the data. So the other 9,999 will kind of average out if they're unbiased not correlated with each other they'll all average out to whatever the average of the data is. So any difference in the predictions of this ensemble will all come down to that one model which happened to have actually figured it out right. Now that's an extreme version but that's basically the concept behind all these ensemble techniques and if you want to invent your own ensemble technique all you have to do is come up with some learner some underlying model which is which you can randomize in some way and each one will be a bit different and you run it lots of times and generally speaking this whole approach we call random subspace so random subspace techniques let me show you how unbelievably easy this is take any model any kind of modeling algorithm you like here's our data here's all the rows here's all the columns okay I'm now going to create a random subspace some of the columns some of the rows okay so let's now build a model using that subset of rows and that subset of columns it's not going to be as perfect at recognizing the training data as using the full art but you know it's one way of building a model that's when I'll build a second model this time I'll use this subspace a different set of rows and columns no absolutely not but I didn't want to draw 4,000 lines so let's pretend yeah so in fact what I'm really doing each time here is I'm pulling out a bunch of random rows and a bunch of random columns correct and this is a random subspace it's just one way of creating a random subspace but it's a nice easy one and because I didn't do very well at linear algebra in fact I'm just a velocity graduate I don't know any linear algebra I don't know what subspace means well enough to do it properly but this certainly works and this is all decision trees do so now imagine that we're going to do this and for each one of these different random subspaces we're going to build a decision tree how do we build a decision tree easy let's create some data so let's say we've got age sex is smoker and lung capacity and we kind of predict people's lung capacity so we've got a whole bunch of data there dirty mail yes whatever so to build a decision tree let's assume that this is the particular subset of columns and rows in a random subspace so let's build a decision tree so to build a decision tree what I do is I say okay on which variable and which predictor and at which point of that predictor can I do a single split which makes the biggest difference is possible in my dependent variable so it might turn out that if I looked at his smoker yes and no that the average lung capacity for all of the smokers might be 30 and the average for all of the non-smokers might be 70 so literally all I've done is I've just gone through which of these encapsulated the average for the two groups and I've found the one split that makes that as big a difference as possible and then I keep doing that so in those people that non-smokers are now interestingly with the random forest or these decision tree ensemble algorithms generally speaking at each point I select a different group of columns so I randomly select a new group of columns but I'm going to use the same rows I obviously have to use the same rows because I'm kind of taking them down the tree so now it turns out that if we look at age amongst the people that are non-smokers if you're less than 18 versus greater than 18 is the number one biggest thing in this random subspace that makes a difference and that's like 50 and that's like 80 and so this is how I create a decision tree okay so at each point I've taken a different random subset of columns for the whole tree I've used the same random subset of rows and at the end of that I keep going until every one of my leaves either has only one or two data points left or all of the data points at that leaf all have exactly the same outcome the same lung capacity for example and at that point I've finished making my decision tree so now I put that aside and I say okay that is decision tree number one put that aside and now go back and take a different set of rows and repeat the whole process and that gives me decision tree number two and I do that a thousand times whatever and at the end of that I've now got a thousand decision trees and for each thing I want to predict I then stick that thing I want to predict down every one of these decision trees so the first thing I'm trying to predict might be a non-smoker who's 16 years old and it gives me a prediction so the predictions for these things at the very bottom is simply what's the average of the dependent variable and this has the lung capacity for that group so that gives me 50 in decision tree one and it might be 30 in decision tree two and 14 in decision tree and I just take the average from all of those and that's given me what I wanted which is a whole bunch of independent unbiased not completely crap models how not completely crap are they but the nice thing is we can pick right if you want to be super cautious and you really need to make sure you avoid overfitting then what you do is you make sure your random subsfaces are smaller you pick less rows and less columns so that then each tree is shitter than average where else if you want to be quick you make each one have more rows and more columns so it better reflects the true data that you've got obviously the less rows and less columns you have each time the less powerful each tree is and therefore the more trees you need and the nice thing about this is that building each of these trees takes like a 10 thousandth of a second all that so it turns on how much data you've got but you can build thousands of trees in a few seconds or kind of data sets I look at so generally speaking this isn't an issue and here's a really cool thing the really cool thing in this tree I built it with these rows which means that these rows I didn't use to build my tree which means these rows are out of sample for that tree and what that means is I don't need to have a separate cross-validation data set what it means is I can create a table now of my full data set and for each one I can say okay row number one how good am I at predicting row number one well here's all of my trees from one to a thousand row number one is in fact one of the things that was included when I created tree number one so I won't use it here but row number one wasn't included when I built tree two it wasn't included in the random subspace for tree three and it was included in the one before so what I do is row number one I send down trees two and three and I get the predictions for everything that it wasn't in and average them out and that gives me this fantastic thing which is an out of band estimate for row one and I do that for every row so none of this all of this stuff which is being predicted here is actually not using any of the data that was used to build the trees and therefore it is truly out of sample and therefore when I put this all together to create my final whatever it is AUC or log likelihood or R2 or SSC or whatever and then I send that off to Kaggle Kaggle should give you pretty much the same answer because you are by definition not overfitting yes you do I am just wondering if it is possible to say just pick a tree that at the best average in all of the country is it possible to pick that one tree that actually has the best performance and would you represent it? yes you can no I wouldn't so the question was can you just pick one tree would that tree, picking that one tree be better than what we've just done and let's think about what that's a really important question and let's think about why that won't work the whole purpose of this was to not overfit so the whole purpose of this was to say each of these trees is pretty crap but it's better than nothing and so we average them all out it tells us something about the true data each one can't overfit on its own if I try and prune them which is in the old fashion decision tree algorithms or if I weight them or if I pick a subset of them I'm now introducing bias based on the training set productivity so anytime I introduce bias I now break the laws of ensemble methods fundamentally so the other thing I'd say is there's no point because if you have something where actually getting a where actually you've got so much training data that out of sample isn't a big problem or whatever you just use bigger subspaces and less trees and in fact the only reason you do that is for time and because this approach is so fast anyway I wouldn't even bother then you see and the nice thing about this is is that you can say okay I'm going to use kind of this many columns and this many rows in each subspace right and I've got to start building my trees and I build tree number one and I get this out of band error tree number two this out of band error tree number three this out of band error and the nice thing is I can watch and see and it will be monotonic well not exactly monotonic but kind of bumply monotonic it will keep getting better on average and I can get to a point where I say okay that's good enough last time but as I say normally it's talking four or five seconds it's just time's not an issue but if you're talking about huge data sets you can't sample them this is a way you can watch it go so this is a technique that I used in the grants prediction competition I did a bunch of things to make it even more random than this one of the big problems here both in terms of time and lack of randomness is that all of these continuous variables the official random forest algorithm searches through every possible breakpoint to find the very best which means that every single time that you use that particular variable particularly if it's in the same spot like at the top of the tree it's going to do the same split right in the version I wrote actually all it does is every time it comes across a continuous variable it randomly picks three breakpoints so it might try 50, 70 and 19 and it just finds the best of those three and to me this is the secret of good ensemble algorithms is to make every one is different to every other tree as possible Does the distribution of the population variable try to predict matter? No not at all so the question was does the distribution of the dependent variable matter and the answer is it doesn't and the reason it doesn't is because we're using a tree so the nice thing about a tree is let's imagine that the dependent variable was maybe very long tail distribution like so the nice thing is that as it looks at the independent variables it's looking at the difference in two groups and trying to find the biggest difference between those two groups so regardless of the distribution it's more like a rank measure isn't it it's picked a particular breakpoint and it's saying which one finds the biggest difference between the two groups so regardless of the distribution of the dependent variable it's still going to find the same breakpoints because it's really a non-parametric measure we're using something like for example Ginny or some kind of other measure of the information gain of that to build a decision tree so this is true of really all decision tree approaches does it work with the highly imbalanced person yes and no so the question is does it work for a highly imbalanced data set sometimes some versions can and some versions can't the approaches which use more randomization are more likely to work okay but the problem is in highly imbalanced data sets you can quite quickly end up with nodes which are all the same value so I actually have often found I get better results if I do some stratified sampling so that for example think about the R competition where most people don't have installed other than user number 5 99% of packages right so in that case I tend to say alright at least half of that data set is so obviously zero let's just call it zero and just work with the rest and I do find I often get better answers but it does depend well you can't call it forest if you use a different algorithm other than a tree but yes you can use other random subspace methods yes you absolutely can a lot of people have been going down that path it would have to be fast so GLM net would be a good example because that's very fast but GLM net is is parametric the nice thing about decision trees is that they're totally flexible they don't assume any particular data structure they kind of are almost unlimited in the amount of interactions that they can handle and you can build thousands of them very quickly but there are certainly people who are creating other types of random subspace ensemble methods and I believe some of them are quite effective interestingly I can't remember where I saw it but I have seen some papers which show evidence that it doesn't really matter if you've got a truly flexible underlying model and you make it random enough and you create enough of them it doesn't really matter which one you use or how you do it which is a nice result but I guess that we don't have to spend lots of time trying to come up with better and better predictive modeling generic predictive modeling tools if you think about things there are better versions of this in quotes like a rotation forest and there are things like GBMs gradient boosting machines and so forth in practice they can be faster for certain types of situation but the general result here is that these ensemble methods are as flexible as you need them to be at the back how do you define the optimal size of the subspace? so the question is how do you define the optimal size of the subspace and that's a really terrific question the answer to it is really nice and it's that you really don't have to generally speaking the less rows and the less columns you use the more trees you need but the less you'll overfill and the better results you'll get the nice thing normally is that for most data sets because of the speed of random forest you can pretty much always pick a row count and a column count that's small enough that you're absolutely sure it's going to be fine sometimes it can become an issue maybe you've got really huge data sets or maybe you've got really big problems with data imbalances or hardly any training data and in these cases you can use the kind of approaches which would be familiar to most of us around creating a grid of a few different values of the column count and the row count and trying a few out and watching that graph of as you add more trees how does it improve but the truth is it's so unsensitive to this that if you pick a number of columns of somewhere between I don't know 10% and 50% of the total and a number of rows of between 10% and 50% of the total and you'll be fine and then you just keep adding more trees until you're sick of waiting or it's obviously flash or you know if you do a thousand trees again you know these are all, it really doesn't matter it's not sensitive to that assumption on the whole Yeah, the iRoutine actually the iRoutine actually so this idea of a random subspace there are different ways of creating this random subspace and one key one is can I go out and pull out a row again that I've already pulled out before the rRandomForest and the portrait encoder which is based on default let you pull something out multiple times and by default in fact pull out if you've got n rows it will pull out n rows but because it's pulled out some multiple times on average it will cover 63.2% of the rows I don't have the best results when I use that but it doesn't matter because in rRandomForest options you can choose is it with or without sampling and how many it makes a difference I absolutely think it makes a difference Yes, to me I'm sure it depends on the dataset but I guess I always enter Kaggle competitions which are in areas that I've never entered before kind of domain wise or algorithm wise so I guess I'd be getting a good spread of different types of situation and in the ones I've looked at sampling without replacement is kind of more random and I also tend to pick much lower n than 63.2% I tend to use more like 10% or 20% of the data in my rRandomForest Yeah I know the concepts I guess I can say that's my experience but I'm sure it depends on the dataset and I'm not sure it's terribly sensitive to it anyway I always put it into two branches so there's a few possibilities here as you split in your decision tree in this case I've got something here which is a binary variable so obviously that has to be fit into two in this case I've got something here which is a continuous variable now I've put it into two but if actually it's going to be optimal to split it into three then if the variable appears again at the next level it can always split it into another two at that other split point You will use the same variable I can absolutely I can so it just depends whether when I did that remember at every level I repeat the sampling of a different bunch of columns I could absolutely have the same column in that group and it could so happen that again I find a split point which is the best in that group if you're doing 10,000 trees with 100 levels each it's going to happen lots of times so the nice thing is that if the true underlying system is a single, univariate logarithmic relationship these trees will absolutely find that eventually definitely don't prune the trees if you prune the trees you will introduce vice so the key thing here which makes this so fast and so easy but also so powerful is you don't prune the trees No it doesn't necessarily because your split point will be such that the two halves will not necessarily be balanced in count Yeah that's right because so in the under 18 group you could have not that many people and in the over 18 group you can have quite a lot of people so the weighted average of the two will come to 70 With scroll down bursting machines Yeah absolutely I have you know gradient boosting machines are interesting they're a lot harder to understand gradient boosting machines I mean they're still they're still basically on some more technique and they're more working with the residuals of previous models there's a few pieces of theory around gradient boosting machines which are nicer than random forests they ought to be faster and they ought to be more well directed and you can do things like say with a gradient boosting machine this particular column has a monotonic relationship with the dependent variable so you can actually add constraints in which you can't do with random forests in my experience I don't need the extra speed of GBNs because I just never have found it necessary I find them harder to they've got more parameters to deal with so I haven't found them useful for me I know there are a lot of bout of mining competitions and also a lot of real well predicted modeling problems people try both and end up with random forests similar rhythms well we're probably just about out of time so maybe if there's any more questions I can chat to you guys afterwards thanks very much