 In our previous video, we learned to estimate the accuracy of machine learning models on independent test data sets. That is not to use the same data we used for training the prediction model. Now our workflow looked something like this. First, we load the iris data set, sample the data for training and use it to construct a classification tree. Then we pass this model on to the predictions widget to classify the remaining data. Now finally, we can double check the results in the distributions widget. Okay, so moving on, I'll arrange the sampler and distribution widgets side by side, like this. Also, I'll turn off replicable sampling so that I get a different sample every time I resample the data. Now with different samples, the model will receive different training data and probably make slightly different predictions each time. Also, I expect that the classification accuracy should change from sample to sample, since I'll be testing on different test sets. Okay, let's give this a shot. I can keep clicking on sample data and I see I get different errors every time. Now if I'm really lucky, I should be able to find a sample with no errors at all. Now the easiest way to find the computed classification accuracy is to take a look at the CA column in the predictions widget. Also, take note here that with each sampling, the classification accuracy changes. They mostly stay in the 0.8 to 0.9 range, but even on such a simple data set, the differences in classification accuracy are noticeable. Now remember, we're doing all of this because we're trying to estimate the accuracy of our machine learning technique. Now, since we can't compute the accuracy on our training data, we have to split the original data set. But now the problem is that the result still depends on the specific split. So what number can we report? Well, we can't just pick the lowest or the highest. Instead, it's standard procedure to report the average accuracy obtained over say 100 tries. But this would mean I now have to press sample data 100 times, write down the result each time, and when I'm finally done, compute the average. Now of course, there's no way I'm doing this by hand. But there's a widget for this in orange and it's called test and score. On its input, it receives our entire data set. Then to perform our scoring, we'll use random sampling. We'll repeatedly train and test 100 times and have the widget use 70% of the data on each iteration. Test and score will do everything for us. It'll split the data, train the model on the training data, test it on the test data, compute the accuracy and then repeat this 100 times and finally report the average classification accuracy. But for now, the widget is still empty as we've only given it some data. We also need to give it a modeling technique. Now in orange, we call these modeling techniques learners. For the insiders, a learner is an object that can be given a data set and will output a classifier. For everyone else, let's place another tree widget on the canvas and connect it to test and score. You can see the connection is now using the learner channel. Well, before we were sending an already trained model to make predictions. Okay, now test and score has done its job. It ran 100 iterations and reports a 94.4% accuracy for our modeling technique. Instead of the random sampling that we've used so far, a more common procedure for accuracy estimation is called cross validation. Now what this means is that the data is broken into a number of folds, say 10. Then in every iteration, one fold is reserved for testing and the rest are used for training. Now the advantage of this is that all the data is used for testing and we test the model on each individual instance exactly once. Okay, so let's give this a try in our workflow. In the test and score widget, I select cross validation and I set the number of folds to 10. Okay, it seems that classification accuracy doesn't change too much. It's now at 95.3%. Now let's take a closer look at the errors our model made with a confusion matrix. Okay, there are a total of seven mistakes, all misclassifying versicolors and virginicas. That's it for this video. We talked a bit about accuracy estimation for classification algorithms and learned to use the test and score widget to evaluate models in orange. Next time, we'll keep going and take a closer look at some of the other scores we can find in this widget as well as how to use it to compare different models to each other.