 In our previous videos, we learned about classification, classification trees, and the estimation of prediction accuracy using cross-validation. Now, classification trees may have an advantage over some other classification methods, as we can view and interpret the trees and thus gain an understanding of the underlying data. But there is one problem. Let's start with another data set, employee attrition. Now this data set reports whether employees left their jobs and contains further information on their age, department, level of education, hourly rates, and so on. So this time, I'll sample the data and then build the tree and view the resulting model in the tree viewer. Wow, the tree really is enormous. It would take substantial effort to interpret anything from here. So let us instead check the tree's structure and visualize it in another viewer called the Pythagorean tree. Here the tree root is at the bottom and the data is first split by overtime, for example. Now the squares used in this visualization show the amount of data in each node. So let's place the data sampler and Pythagorean tree widgets side by side. Also, I'll uncheck replicable sampling and then resample the data. Now you can see how the structure of the tree changes. And as it turns out, every time I sample the data, the resulting tree is completely different. So if we keep trying on more and more samples, it becomes apparent the trees are not very stable and the result heavily depends on the composition of the input data. So even a slight difference in the data changes the tree structure entirely. Now this is because every splitting node depends on the selection of just one feature. So when several features provide an almost equal amount of information, small changes in the data will influence which of the features is picked. And splitting according to a different feature means the entire subtree from this node onwards will also change. Now we can easily interpret trees, but that doesn't help us much if they're so different from one sample to the next. But building so many different trees may still have an advantage. So each different tree from a data sample may see the data from another angle. Just like us, people also have slightly different views on various topics. Take for example to show who wants to be a millionaire. There every contestant has the option to ask the audience for help with one multi-choice question. Then they can make an informed decision based on the distribution of the audience's answers. And it turns out that picking the most popular choice is often their best shot. Now we can apply this same principle to take our classification trees and build a classification forest. Instead of developing just one tree, we can sample the data, build a forest of trees and have them vote. Now the ensemble of classification trees is what we call a random forest. So let's build one from our attrition data, then visualize the result in a Pythagorean forest widget. Great, 10 trees were built and they each look distinct. Random forest intentionally randomizes the trees, not only by sampling the data, but also by arbitrarily choosing the features by which to split the data at each node from a list of most suitable candidates. Now classification by random forest uses a voting system where each tree votes by predicting the class, then the output class is the one that receives the most votes. So let's see how well this method performs compared to a single tree. First, I'll clear out our workflow, leaving just the data loading widget. Then let's estimate the accuracy of our learners through tenfold cross-validation. There we go. Now we'll be comparing trees to random forests. And wow, the forests really do perform substantially better. The classification accuracy goes from 0.81 for trees to 0.85 for forests. And other scoring methods like AUC also report even bigger differences. But I'll go further into depth about what exactly AUC is in one of our following videos. One of the parameters of a random forest is how many trees it contains. So let's raise this to say 500. Now the accuracy again increased slightly and AUC got an even more significant lift. So random forests are great classifiers if we want increased accuracy. The problem they face is with interpretation, where simple models tend to have lower accuracies. They are often very easily interpreted. On the other hand, complex machine learning models that make the most accurate predictions can be very hard or even impossible to interpret. And I'll go further into this topic in our upcoming videos, where I'll introduce some other classifiers and means of their interpretation.