 This video is a small introduction to Orange. Now Orange is a piece of software that allows for graphical computing. So instead of writing code you just drag these nodes onto a canvas and behind the scene that generates some Python code and that does all your data signs for you. So in this in this video I'm going to show you how to do a small linear regression model and compare that to a random forest model just on some test data. Now in my unit of course we'll use Julia, Python, R, the Warframe language but sometimes we just want to get things going quickly and just have a look at a quick model and for that we use Orange. Now Orange produces wonderful output wonderful plots and graphs so sometimes that's quite enough to do but sometimes it's just something quick and easy just to get going. So in this video I'm going to show you how to create those two models so we're going to ingest the data we're going to do the normal descriptive statistics and a bit of data visualization we're going to create the models and then we're going to compare them as far as the receive operating characteristic curve or the area under the curve then and and scatter plotters to see the two classes that we're going to deal with as well as a confusion matrix so we can just compare these models and it's really very easy we're just going to drag some icons onto a canvas and that's going to do all the computation for us. I've opened my browser and I've navigated to orange.biolab.si and this is where you can download Orange. Of course if you have Anaconda installed you can use it right from Anaconda but if you want a separate download as easy as clicking on download and just following the instructions they've got a lovely blog that is quite up kept quite up to date and there's some lovely tutorials there or at least reference to tutorials and they've also got a youtube channel so I've downloaded and installed orange and this is what it looks like when you open it and you can see the video tutorials there on youtube get started some examples documentation and all we're going to do is just to create a new blank worksheet there and on the left hand side you'll see there's a couple of tabs there's a data tab and it shows you all these nodes that you can drag onto the canvas and some visualization nodes you see their bar plot box plot and that's how everything works you just drag these icons to the screen to your blank canvas here and you do some data analysis so the first thing we're going to do is just to import a CSV file so right under data there's CSV now what I could also do is just double tap double click or right click I should say that's two fingers on my Mac here uh click with two fingers but right mouse button and you can just start typing so it is CSV file import that is going to be right at the bottom but all these icons on the left hand side these these nodes they're all in this little search box and the one we're going to go for it's a CSV file and there it is and we're going to place it here on the left hand side you can place it anywhere but we're going to build the tree structure towards the right as we do our analysis so let's double click on that node and we can import a file so I'm going to click there on browse my file system and I've navigated here to the file that I'm going to use and that's it it's a common separated value file now the CSV file import gives you some options here in case your separation wasn't just the commerce but it might have been tabs etc and you can see down here a very quick just view of the first couple of rows of data there's gender height and weight and of course what we're going to do we're going to predict gender and this is a binary classification of gender just to keep things for the sake of keeping it simple and then there's height which is numerical continuous numerical variable here and weight which is also continuous numerical I don't know who measures height and weight to these extremes but once again this is oversimplification just for us to have to have something to show so we're going to click okay and then that's going to import it says 10,000 rows were imported three features and there was no metadata there again you can just go back to the import options there's also the help there and that's it it's it's done and we can just close that window so there we go we can also just have done it with file and just import a file same story but then you don't have such absolute control over it but what you can do here just to let you know there's some information there for you is that you can do a URL as well so if your CSV file is on the web you can just go there directly but we've used the CSV import here so let's click on that file and just hit this the delete key and it is gone now the first thing that we always want to do is just to view the data we saw a couple of rows there so what I'm going to do I'm going to hover you on the right hand side you see there's a bunch of little marks there I'm going to click hold and drag out and there we go now this selection pane opens again that have got all this all the nodes there so I'm going to type data table there it is on the second one there so I'll just click it drag it down a little bit and if I click on data table it's actually going to show me this whole CSV file of mine and you can see we've also visualized the numeric values so what that does it draws these little bars underneath so you can see the bar 473 there's longer than the bar 468 or the bar for 241 so this is a bit of visualization um color by instance class that's not going to nothing for us now because we've got all mail at the top but anyway we see 10 000 instances no missing data three features no target variable and no meta attributes so definitely we'll have to do something to select one of these columns is our target variable and the other thing as our predictor variables so let's just close this and first things first well that was first thing just to view our data see that it's imported nicely but twice would just be some descriptive statistics so I'm just going to drag out again just put it right at the top and it's called feature statistics now that's a bit difficult because in the beginning you're not going to know the names of these nodes but it doesn't take too long and then you know so feature statistics double click on that and there we go and you can say color by and color by of course you're going to choose a categorical variable so gender it is and we see weight there height and you see little histograms there after this distribution of weight for male and female because that's the two that we get and for height we see and then this is a little bar chart we can see how many males and females there were and then we see a bit of descriptive statistics the center would be the mean the dispersion the minimum and maximum values there and then the percentage missing the total and the percentage missing so that's a just a nice quick little demonstration there of of of our descriptive statistics and that makes it very easy of course once we have our descriptive statistics we might want to just create a couple of just data visualization so what about just creating let's just create a box plot so let's do box plot there we go now I can rename this so let's rename this I'm right clicking on the node itself and I'm going to say rename and let's say we're going to make this height by gender so that's the aim at least of my box and whisker plot here so let's double click on it and what we want to do well it seems just to have gotten everything that we that we do want so we want this one by height so that's a little change we had to make so we want this by height and the subgroups that we want gender and that's the only category available that I could find so it's selected that one by itself and now we see a box and whisker plot these are horizontal box and whisker plots we see the central box there and we see the whiskers and down in this instance to the maximum minimum and we see see here that we've we're comparing by means and I can also compare by medians if I compare by means it's actually going to do this t-test for me and it sees a t statistics the statistic there of 95 and that's a highly significant you see the p-value there less than 0.0001 at least and you can see quite clearly there the the distinction between these two now what you can also do is generate a little report on this so if you hit a little report there you can save that report and that's going to save an html file which you can view later and you can of course just save this plot if you click on that it's going to save the image for you and you can as a png as you can see there but you can also do svg and pdf and you save this image and it's beautiful images beautiful plots for for your report so let's just let's just do this again so I'm going to just right click there and let's type in box this time so I put my box plot there now it's not connected because I just created something blank on the blank canvas but I can once again go to the right hand side here and just draw out and connect it there to the left hand side right click on it I'm going to say rename and let's make this weight by gender weight by gender I'm going to call it that's just a description there for for myself so I can remember when I view this the scan this again what I was what I was going to do double tap on that and we want the weight that is selected here and you can see there because we compare by means since t-test at t statistic again 131 so a big difference between between male and female in this little toy data set so that would be it as far as descriptive statistics is concerned we've got our mean and our dispersion or standard deviation minimum maximum etc and we've got these two beautiful plots as far as our two feature variables are concerned so let's have a look at what do we want to do now we want to build a model so what we can do is let's just for let's just tap here on this t I'm going to hold it down a little bit and let's choose 24 pixel I want something big on the t there's obviously some text and let's click there and let's type in my models I'm going to create some models down here so you can do that just click off of it and there we go you can drag it around and of course you can put in these little colored arrows if you want it to as well so that makes it nice just to to to generate a nice looking canvas here so the first thing we want to do we've got to tell we've got to tell orange here what our feature variable is and what our predictive variables are so what I'm going to do I'm going to drag out once again and let's put it way down here and what I'm looking for is select columns once again as I say in the beginning you're not going to really know what all of these nodes do just takes a bit of time or at least you can you can read on that so let's double click on that and it's actually done a good job here I think it's because I had done this before remember once again you can generate a report etc so for my feature columns let's let's select them and put them back let's select them all and put them back so you can just show you if in case you run into this and it looks like this so I have feature columns up here and I've target ones down here so I'm gonna I'm going to either just drag height and weight into my feature columns or I can select it and then just click those little arrows there and I can also switch those two around and I don't have to do anything because this little mark is checked by default if it's not then you can say send selection but if it's done if it's clicked there it's automatically going to do what you want it to do without without doing much so now orange knows you know what this is all about what our feature variables are and what our predictive variables are of course now we've got to do some training test split so let's do that I'm going to drag out again and what I'm looking for now is data sampler data sampler and it's right there at the top so let's double click on that this is bring it in and what I want to do is some cross validation where you could go for that some bootstrapping if you want to do that but let's try to go for fixed proportion it's going to take 70% of our data and that's going to go to the training set and then 30% it's going to to take as our test set so let's just say sample the data and then it's done and you see out of the 10 000 there's 7 000 selected as our as our training set so let's just close up here and now the next thing we want to do is we want to create a logistic regression model so let's drag it out let's drag it down here and we're going to look for logistic regression there we go logistic regression of course if you click down here there's models you'll see all the different types of models there are support vector machines random forests naive bays adaboost logistic regression so let's start with that logistic regression we can double click on that bring it in and you can give it a name there so if you want a different name you can see from weak to strong what you want your c variable to be and you can choose lasso or rich regression let's just keep it at l l1 regression at the moment so nothing fancy but that's where you would set your parameters for your logistic regression and what we want to do from there is let's let's click on here evaluate because what we want to do is some predictions so let's bring it in there that is going to be our predictions now what we want to bring into the predictions of course is two things one is our logistic regression model which was trained on the 70 data and the remaining 30 data we want to bring in as well so that we can see how well this is doing so i'm going to connect the logistic regression and i'm going to select the data sampler please remember to do that if you're doing these predictions so what you want is the 30 that was remaining connect that to the predictions node and your model to that as well so if we double click on that we're going to see what is happening to each and every one of our instances so it's going to go from row one it had a split there and it's of 0.21 as far as our logistic regression model is concerned 21% probability there that this is female and then overwhelming that it's a male that it's female i should say so that's what it predicted so it's just going to show you this this thing and what it does at the bottom also gives you area under the a curve there for the receiver operating characteristic curve and that's 97 not too bad you see it your CA your F1 scores and then your precision and recall then 92% 92% percent now it's not too nice to have it like that there's a couple of things we can do let's create number one to visualize this let's create a scatter plot it's right at the top for me there and if i double click on the scatter plot there you can see and you can see it's colored for us the regions let's just put that off and you can put different things on the x and the y axis i think it would be better for me to have weight on the x axis and then height on the y axis that just makes a little bit more sense you can see female and male there a male color red female color blue and you can see the predictions you can see the predictions there playing around with these as i said there's your color regions you can put in you can put in some grid lines etc show regression lines as well so and once again you can save this as a as a report and to you can export it as an image so fantastic there one more thing that you might want to do is just to um to have a confusion matrix so let's take this confusion matrix here let's drag that out and that's a fantastic tool just to to evaluate your model so let's double click on that and there's our model and we can select here what we want the output to be but we can also then see what we want to show so let's show the number of instances and so 3,160 in my instance here was female and a predicted female so there's actual here on the as far as the rows are concerned predicted as far as the columns are concerned and then we see three three thousand two hundred seventy eight were male and they were predicted as male and then what we could also do is just do that as a as a proportion so 92 92% and then 8% 8% so that works very well as far as looking at how well your model is doing lastly let's just do an ROC curve that's always good so let's draw let's just get the ROC curve there there we go double tap on that and there we see our ROC curve and you can do all sort of selections here as far as you can see and at the moment the target is male let's change that to female depending on that type of model you'll build of course you'll get different results so that's fantastic but that's let's not stop there let's go back to models let's just compare our logistic regression model to something else watch you choose random forest at least that's what's the plan that I had so let's just draw from the data sampler now to to the predictions well that's already done so what we need to do here is from the data sampler we're going to draw to a random forest so 70% of the data is now going to go to random forest and remember then we're going to bring it in as well to predictions let's just drag it into predictions there there we go so let's double tap on the random forest and you've got a couple of selections there you can select the number of trees I'm just going to leave it at the default and you can number of attributes considered each split replicable a training etc do not splits subset smaller than five you can also limit the depth of your individual trees and once again it's saying there that it's going to take seven thousand of of our data values there so if I look at predictions now I can see there's a logistic regression column and a random forest column and you can see the differences there if you were so inclined to run down all of those but look at this the AUC of my random forest was 99.8 just at the default so that's fantastic it's doing much better in this toy little data set at least than the logistic regression and because we've just drawn it in it'll do all of these for you so in the scatter plot that's what we have at the moment and we can choose the logistic regression model there or we can choose random forest there might not be too interesting let's look at the confusion matrix and now we can compare this random forest and logistic regression models as far as our confusion matrix is concerned and then also the ROC curve and you can see how much better our random forest is doing so let's just close that and that's it that's our simple advice to build two models and to compare these two models and say all of these if I double tap on that I can export that as a report and that'll be an HTML report have a look at that just save it then to to your system and when it comes at least to these lovely plots that we've got here I really do like the way that these are ended and then of course you can just click right at the bottom there and save that image so remember this workflow it's a very easy workflow once you get used to it and all you have to do is start looking at these nodes and just start playing with them and you very quickly learn how to put them together and there's of course a lot of things that you can do as far as changing the data as it comes in the strain test split a lot of the nitty-gritty or very quick very quick workflow that you want in the beginning or you just sort of want to just see where these models are going before perhaps you want to want to go into the notebook or use our studio or whatever to get to get some better results or delve into things a bit deeper this is really fantastic and for many things it's the only thing you need to do is just to use orange and just build your little tree structure here and save all your reporting so that's it for this video I hope you enjoyed that if you want me to make some more videos just here on using orange this is that for me really is something that we'll we always start off with when we build our models just to see where things are going and depending on the project this might be this might be the only thing that that we do so have a look have a look at their website look at some other tutorials look at their lovely very good youtube videos you'll learn a lot from that and give them a bit of support and spread the news of this tool that's really existed for quite some time it's sitting nothing great new and I'm not sure why I haven't made a video about it yet fortunately I've found some time just away from the hospital at the moment after a long day in theater just to do this quick little video so I hope you enjoyed that give it a thumbs up subscribe comment all those good things subscriptions always help and it motivates me to to keep making these videos so thanks for watching go use orange and build some fantastic models