 In this video I want to show you how easy it is to do machine learning in Mathematica. What I've set up here is just a file that I'm going to import. I'm setting my notebook directory as the active directory because my machine learning file which is a comma separated value file lives right inside of the same folder as the notebook and let me show you what this file is all about. There we go. We see that we've got 1, 2, 3, 4, 5, 6, 7, 8 predictors, 8 predictor variables and an outcome all living inside of a data set. Now just to tell you what this data set is all about, it is 8 clinical features here and I'm predicting whether a patient has appendicitis or not. Some of you might recognize this as the modified alvarado scoring system where we look at migration of pain, anorexia, nausea and vomiting, right lower quadrant pain, rebound tenderness, a heart rate, temperature and white cell count. We can see these lastly have been normalized and we use those 8 predictors to predict whether a patient has appendicitis. So we're going to go through 2 machine learning algorithms just to predict this outcome. First of all let's just look at how many patients are in our data set. We see there are 200. Now I want to divide this into a training data set which I'll use to train my models and then a test data set so that I can just check the accuracy of my model. So there are 200 values and I want to at random select 80% of these. So I'm going to use this comparative variable which I call 80% and this is going to be 0.8 times the length of the data. Now it's 200 so that was always going to be quite easy so I'm going to select 160 of these but I want to do it at random. I don't want to select the first 160 as my training set and then the last 40 as my test set. So just to be sure that means we're going to have 40 there inside of the test set. Now the way that I'm going to show you here to select at random 160 rows is a bit laborious. I'm going to stick to it though just to show you the thought process here. All I want you to do here is just to create a table that's a one list from first patient to the last patient, 200. I'm going to use this as the address for the rows in my data set. So from these I want to extract 160 of these at random and I don't want any repeats. I don't want to select patient 88 twice. So I'm going to do random rows so we're going to be my computer variable and I want to select a random sample, a range is 1 to 200, choose out of that 160 for me. So I could have just put 200 there and 160 there but I'll leave these computer variables just to show you what it's all about. So I've got 160 of these values here chosen at random, totally at random. Now random rows select, what I'm going to do here, I'm going to just execute that so you can see the difference. All I'm going to do I'm going to go from a list to a list of lists. I need to do that and I'll show you why I need to do that in a moment. So I'm just using table random rows and I'm just iterating from 1 to 160. So I can just get all of these into a list of lists so each one of these becomes its own little list. The reason why I want to do that is because I want all the row values that were not selected. Of those 200 there were 40 that were not selected and I need them and I'm going to use the delete command for that and I have to pass all rows the list and from that I subtract these but I must subtract the list of lists because I'm just referencing their position not their actual values and that's why I had to create this list of lists and if I do that I'm left with the other 40 that were not selected at random which now means I can do my training data set and my test data set. So my training set is going to be data and I reference those 160 randomly selected rows and then the test set is the other 40. And there we go I've got my training data set and my test data set. I'm not done yet because this test data set if I want to use a classifier measurement to test the accuracy and other properties it has to be in a different format than this and this is the format I'm just going to execute this so that you can see. It's got to be lists of lists and it's got to be in a format where I have for every row of those 40 I have my eight predictors there, eight predictor values, the arrow and then the outcome. Eight predictor values, arrow and the outcome and the way I do that is just to create a little table and inside of that I'm going to create a list which iterates through every row of every column. So column one, two, three, four, five, eight columns, arrow and then the outcome which was column number nine and I want to go through all 40 any n goes from one to 40 of that set so it's got to be in this format. So let's look at the random forest I'm going to use a computer variable called random forest model it's going to be a classification problem, I'm sitting this up as a classification problem not as a regression problem I just want to have an outcome of zero or one, zero does not have a pentasite as one does I'm going to pass my training data set but I've got to tell Mathematica which one of those columns is the outcome. So the histo column, the last column nine is the outcome and the method that I'm going to use is random forest. Let's execute that and I now have a classifier function object here. To use it though I need a classifier measurement I'm going to call that RF model CM for classifier measurement of my RF model and as arguments I pass the model that I've just created and then this test, data test model in other words this and that is why I had to get that training the test set I had to get into this format so I passed that, I passed that as an argument and I now have a classifier measurement object let's look at the properties that exist for this I can look at accuracy, accuracy rejection plot, class rejection plot so first of all what we want is the accuracy and I get an accuracy of 92.5% so very good let's have a look at where it went wrong and this is the confusion matrix plot there we go so my RF model CM my classifier measurement let's have a look. I have here when the predicted was not appendicitis I have zero here so let's have a look so this is zero and zero here and one at the bottom so I have none of those cases if it predicted a one the outcome was one in 17 if I predicted a zero 20 of those were indeed zero and if I predicted one indeed the actual class was one and all three of those so one and zero is three were incorrectly then so down here is my predicted class on the left is my actual class so if we predicted a one three of those were actually a zero so we have the confusion matrix there just want to show you this little manipulate function which I can just go through all of the properties in a little table just remember always to put the definition save definitions as two so that so that it can get executed every time so it's quite a neat little thing to do quite a handy little manipulate function to write here because it makes life easier if you just want to iterate through all of those we can see all of the properties here there was my accuracy and if I go to the if I go to the confusion matrix plot we see is there so I can just click on any of those through this little manipulate command let's do one more a linear regression model I'm going to call it RLR model classify again my training dataset telling me in America that the histo column is the outcome and this time I'm going to use logistic regression same story I'm going to do a classifier measurement and pass the model that I've just created and the test the data test model remember how we modified that now I have a classify measurement object and once again I can iterate through all of the properties and now I see I've got an accuracy of with my linear regression model of 97.5% that's indeed very good and let's look at the confusion matrix and look at that there was only one mistake made if the predicted class was the patient had an appendicitis actually the patient didn't so in 18 of these was predicted that the patient has appendicitis and indeed the patient did have in 22 here patients were predicted not to have an appendicitis in the actual class was not appendicitis let's have a look one more very quickly a support vector machine model exactly the same thing is going to happen here first do just a classification and then a classifier measurement passing the same arguments I'm using this neat little manipulate and we see there was an accuracy just of 100% so no mistakes were made there in this model one last thing I wanted to show you just some feature extractions I'm going to create this computer variable called data feature and it is all the rows and eight columns so all I'm cutting out here was that last outcome column I don't want that outcome column I only want all the predictor columns we're going to create a computable computer variable called Fe and that's my feature extraction and I use the feature extraction function and I pass this data feature and other words that was my only my eight columns in my eight predictor variables there so I'm going to create a new random forest model here I'm going to call it random forest model te again classify passing the training data set telling Mathematica that histology is indeed my outcome the method is random forest but I'm using a feature extractor so that's one more argument that I can pass and the feature extractor is this fe that we've just created here if we now look at creating a classifier measurement and we look at accuracy we see 92.5 and if we go back to our model there the accuracy was 92.5 so in this instance it didn't really improve anything so that's a quick look at how easy it is to do machine learning with Mathematica