 In my last video, I shuffled my classification data and still managed to get excellent results from my classifier. So, here's what I did. First, I loaded the iris dataset and randomized the class labels. Second, I quickly plotted the data just to make sure it looked messy enough. Then third, I used this data, let's call it our training data, to develop a classification tree. And finally, I fed that same data to the predictions widget and took a look at the results with the distributions widget. Now, these mostly accurate results hint that machine learning works like magic and can devise great models even if input data is messy and the class labels are in chaos. Now, what I did is of course wrong. I made the predictions on the same data I used for training. And to understand why this is a huge problem, let's take a look at the tree I've created. Remember, the one I trained on the original iris dataset was small and simple. On the other hand, opening the tree viewer now, I see this monstrosity. Wow, 97 nodes and 49 leaves. This tree is huge, especially considering the data consists of only 150 instances. Also, as you can see as I scroll back and forth, most of the leaves contain just a couple of instances. Now, on this randomized data, the tree cuts the feature space into small pieces with only a few instances, usually of the same class. In doing this, it remembers all the data from the training set. So, when we use such a classification tree to make predictions on the same training data, it performs well simply because it has seen it already. Now, I can push this idea even further by removing any forward pruning. Remember, I discussed forward pruning in my last video. Now, this way, the tree gets even bigger, 163 nodes and 82 leaves. In the distributions widget, I can now see that nearly all the predictions on the training data are correct. There are only two misclassifications. Using these results to measure the accuracy of a model would however be way too optimistic, as it favors models that remember everything and completely overfade the data. Obviously, I should be testing my models on separate data sets, not the ones I use for training. So, let's first try this out on my randomized data. With the correct testing procedure, I expect the accuracy of my model to be very poor. Here we go. I'll remove the scatter plot from my workflow first and replace it with the data sampler. Now, by default, I should get 70% of the data on its output. So, I'll feed this sample to the tree widget, turn pruning back on, and then test the resulting model on the remaining 30% of my data. Now, in order to do this, I'll have to link the remaining data output from the sampler to the predictions widget. So, this is my new workflow. Remember, I'm training the tree on the training data comprising of 70% of my initial data set, then testing it on the remaining 30%. So, this time, the distributions widget shows the tree made a lot of errors. Just like I expected. Bad data with randomized class labels mean the tree is basically making random predictions. But notice that I only observed this poor performance because I used a separate test data set that was not included in the training data set to estimate errors. Now, I can make one more modification to my workflow. I'll remove the randomization to estimate the accuracy of a classification tree on the original IRIS data set. So, again, I'll assess the accuracy on a separate test set. Now, removing the randomized widget and connecting the data directly to the data sampler, let's take another look at our results in distributions. Okay, this time the tree made only two mistakes, confusing virginicas and versicolors. These misclassifications are presented clearly in a confusion matrix, which I'll add to the output of the predictions widget. Note the two items that highlight our misclassifications not on the diagonal. Now, the main takeaway of this video is that you should always estimate accuracy on fresh data and never do so on the data used to train a model. But one problem still needs to be solved. Especially with smaller data sets, accuracy estimates can heavily depend on the randomness of splitting data. So, we'll have to figure out which samplings to consider and which estimates to report. But we'll talk about that in our next video.