 Thank you. So, good afternoon. First of all, thanks for the opportunity of being a speaker in this amazing conference. As she said, my name is Veronica Ollon Canedo. I work at the University of Acoruña. And my research topics are mainly feature selection, big data and ensemble methods. When I was doing my PhD thesis, which was mainly based on feature selection, at some point we realized that sometimes we didn't know which method, which feature selection method to use, because neither of them was the best one always. So we thought maybe we can combine them to take advantage of their strengths and overcome their weak points. So we started to work on ensembles. And, well, now ensembles are very important in machine learning. Actually, they are the ultimate tool that researchers use trying to catch that last bit of improvement in their prediction accuracy. So today's talk is about ensemble methods in machine learning. We will start with a definition of an ensemble. In machine learning, an ensemble is a combination of several models, instead of using a single model, trying to improve the accuracy or the performance. Why? Why do we use ensembles? Well, we use ensembles in our daily lives when we want more than one opinion. For example, when somebody has a disease, it is common that he or she asks for a second opinion, even a third opinion, etc. We prefer to have different opinions, so it is more reliable. And it happens the same in machine learning. The two heads are better than one. Also because of the no-freelance theorem that says that there is no single universal method that works perfectly on every possible scenario. So it is better if we have a combination of them. Let's start with some motivation. Imagine that you are a researcher, you have super good results, so you decide to write a paper. The paper looks good, so you think, okay, I will submit it to a journal. If this is a good journal, it will go under peer review, which means that your fellow researchers will review the paper. Then after a couple of months, if you are very lucky, or maybe one year, I don't know, the reviews come out, and the editor-in-chief has to make a decision about it. So imagine that the reviews are there, and reviewer one thought that your paper is really great, and reviewer three says strong accept. Reviewer two is happy, not so happy, but well, accept, not bad. And reviewer three has doubts. Maybe he likes the results, he or she likes the results, but doesn't like, I don't know, the writing or something. So reviewer three says it's a borderline paper. I'm not sure. Probably in this case, the paper will be accepted. Imagine now that again reviewer one says strong accept, reviewer two says accept, but in this case reviewer three says reject. So we have two good opinions, one bad opinion. Is it the paper accepted in this case? I don't know. It depends. And imagine now that if you are the reviewer, when you submit your review, you can indicate your level of confidence. This happens usually. So the first reviewer who said strong accept has a low confidence. He or she is not an expert on the topic. Maybe he or she doesn't understand the paper. Reviewer two who said accept has also low confidence. It's not very reliable. But the third reviewer who said reject has a high confidence on the topic. He or she is an expert. So maybe his or her opinion is more important. And in this case, probably the paper is rejected. Or the editor can ask for more reviewers, which usually happens. Because in our ensembles, we can increase the size of the ensembles so it is more reliable. Okay, so now we can see a scheme of a typical ensemble for classification. Actually ensembles were born in classification. So they are usually used in classification. We have our training data. And then we have different classification models. These classification models can be different because they train on different data because the classifiers are different or because the parameters are different. I don't care, but they have to be different because it doesn't make sense to train the same model once and once. Since the classification models are different, the predictions are also different. But if it is a very, very simple problem, maybe the predictions are the same, but this doesn't happen. So we have different predictions that need to be combined somehow to obtain our final prediction. Now that you know what an ensemble is, let's do a small game. All of you will be a classification ensemble. And we need to classify these images in labradoodle or fried chicken. Maybe you have seen this before. Of course, I distorted the images a little bit. So I know, for example, in the top left corner, this one is easy. You know, it's a dog, a labradoodle. Or, I don't know, can you see the pointer, right? This one is chicken. Yeah. But, you know, for example, this one is not so easy. This one is not so easy. Here we can see the solution. So now the ensemble, which is you, has been trained. And we need to classify these new images. What do you think? Hands up for labradoodle? Okay. Hands up for fried chicken? Hmm. Interesting. I cannot say the majority of you said labradoodle or fried chicken. But it was labradoodle. The point is that when we combine the multiple opinions, sometimes we get the good result. But this one was a very difficult one. I picked it very, very difficult. Okay. So ensembles are very used and they are very successful. And in fact, they place first in many prestigious competitions. For example, in the Netflix prize competition, which consisted in predicting user ratings for films based on previous ratings, but without information of the users or the films. Yeah. Sorry. This is one of the samples. And we are going to see the Netflix prize competition. We have also the KDDcap99 competition, which consisted of detecting attacks and normal connections in a computer network. And also I'd like to emphasize this paper which analyzed hundreds of classifiers to see if we need so many of them and also to see which one was the better. So in the case of the Netflix prize competition, the winner combined multiple individual predictors into a single final solution. This is the definition of an ensemble. In the case of KDDcap99, the winner used a mixture of bagging and boosting. Bagging and boosting, as we will see, are typical examples of ensembles. And in this paper, the classifier most likely to be the best was run on forest, which is also an ensemble method. So classifiers are successful. They work. So when building an ensemble, this is the scheme I presented before, the most important thing is diversity. We need to induce some diversity to the model. Otherwise, it won't work. This diversity can come from the training data because we can do bootstrap samples or selected subsets of data. Or the diversity can come from the classification level because, as I said before, we can use different classifiers or the same classifier with different parameters, et cetera. If we have different samples of data but the same classifier, we can call it an homogeneous ensemble. Of course, we can have a mixture of them. The more diversity, the more fun. But we need to be careful. This is the fable of the blind man and the elephant. So we have several blind men which are trying to know what they are touching. And even since each of their descriptions were true, we need to be careful. And even since each of their descriptions were true, it would be better if they come together and discuss their understandings before coming to a final conclusion. Because, yeah, the combination part is very tricky and very, very, very, very important in an ensemble. If we don't combine good, the results, it's useless. So to combine the different predictions, we have different approaches we can follow. For example, we can do a selection of the best model, a fusion of them using majority voting, use another classifier, multiple options. These are some of the most popular ones, majority votes. Everybody knows majority votes is what we usually do in democracy in many aspects of our daily lives. We can just select one model, the one that is more promising or with high secrecy or whatever. Sometimes majority vote is not enough and we need a weighted majority vote, like in the case I presented with the paper, some opinions are more important than others, so we need to emphasize them. And there are some classifiers that don't give you only a prediction but also a probability of this prediction. So if we have the probability, we can play with them and we can use decision rules, such as zoom, product, et cetera, to combine the results. Now let's see the classics in sample learning. Bagging and boosting are the most popular ones. In bagging and boosting, we have several different training data sets. These training data sets are built by random sampling with replacement. Since this is with replacement, it is possible that some observations may be repeated. And the difference between bagging and boosting is that in bagging, every observation has the same probability to appear in a new bootstrap sample, but in the case of boosting, the observation has weights, so some of them are more likely to appear in a new training data set. Once we have the several training data sets, the same learner is trained on all of them. Here we can see the difference in the stage of learning. In the case of bagging, the training is parallel because the models are independent from the others, so you have a bootstrap sample, you train it. But in the case of boosting, this is sequential because it works in the following way. You have a bootstrap sample, you train your classifier, and then those cases that were misclassified will increase their weights, so in the next bootstrap sample, it is more possible that they appear, and in this case we will emphasize the most difficult cases. After the ensemble has been trained, we have new data that we want to classify, and in the case of bagging, the new data just goes to all the individual learners, and we do a simple average or majority booting. And in the case of boosting, there is a second set of weights, in this case according to the performance of each individual model, so it is a weighted average or weighted majority boot. And this is the other very popular ensemble model, which is random forest, which consists of a set of decision trees. Each decision tree has its own data made with a bootstrap sample, and by using all the features, it randomly selects some of the features to introduce more randomness. In this case we can see an example where we are trying to, with five decision trees in this forest, we are trying to classify objects in blue or black, and imagine that a new object came, goes through the trees, and ends up in the circle leaves. So if we want the probability of this object to be black, we have to count the occurrences of black in the circle leaves, which is 8, divided by the total number of occurrences in the circle leaves, which is 11. So this is how random forest works. I only gave you the basics about ensemble methods, but if you want to know more, there are many, many nice books. These four are among the most popular ones. My favorite one is the last one, combining patterned classifiers by Ludmila Kuncheva. Kuncheva knows a lot about ensembles. I was very lucky in a summer school. I attended a course taught by her about ensembles, and well, it was really, really good, and when I had to prepare myself this talk, I remember her talk, and I borrowed some of her examples because they were very interesting. And as I said, ensembles are typically used in classification, but they can be used in other fields of machine learning and with a high rate of success. For example, we can find successful ensembles in clustering, which means finding the groups among the observations. In discretization, when we need to convert continuous values into discretized values, we can use ensembles when dealing with unbalanced data, which means that most of the examples belong to a majority class, and there is a minority rare class that sometimes the classifiers just overlook. Ensembles are also used in quantification. Quantification is similar to classification, but instead of classifying every single example, we just want to know the proportion of examples belonging to a given class. We can also use ensembles in missing data when some data is lost and we need to do imputation to retrieve it, and also in feature selection, which is when we select the relevant features. You can use ensembles in, I think, in every problem. As long as you can introduce some diversity because of the data or because of the parameters, because of the different methods, just try it because maybe you obtain very good results. As I said at the beginning, one of my research topics is feature selection, so now I will show you how ensembles are adapted for feature selection. This is a definition of feature selection, it's the process of selecting relevant features and discarding the irrelevant and redundant ones, and we do this with the objective of improving performance. Sometimes this is surprising, like, so you have less data and your model is better? Yes, because we prefer data quality rather than data quantity, and when we get rid of those features that we don't need, it is easier for the classifier to learn. If you want to know more about feature selection, you can check out my talk last year here in Big Data Spain. This is a scheme for a feature selection ensemble, very similar to the classification ensemble, but instead of having different classification models, we have different feature selection models. These feature selection methods produce a selection of the important features that need to be combined somehow to obtain our final selection. Again, diversity is the key, as in any ensemble, and this diversity can come again from the training data because we can perform bootstrap samples, select the subsets of data, subset of features, et cetera, or from the feature selection methods models because we can use different feature selection methods or different parameters for the same method. Combination in this case is still tricky. I would say that it's more important because feature selection methods can return either a subset of the relevant features or a ranking and order ranking of all the features, and according to that output, the combination is different. So the first thing is, are we combining subsets or rankings? And then we can again do a selection of the best selection, a fusion of them, majority voting, use a classifier, et cetera, multiple ways to combine different selections. When combining subsets of features, these are ones of the most popular ones. We can use some kind of majority vote, so those features that appear in most of the selections will be part of the final subset. We can use a union of all the selections, but we need to be careful because we can end up with the whole set of features which is not good. We can do the intersection, but in this case, we can end up with the empty set, which is not good either. And then we can use, for example, classification accuracy. So you have a selection, you compute classification accuracy, you add another selection, you compute again, is it better, is it worse? Well, this can work, but we need to be careful because this can overfit the data in the training. And it's like a wrapper in feature selection and very specific to a given classifier, but we can use this. And as an alternative, less computationally expensive, we can use complexity measures which give us an idea about the complexity of the data. We prefer less complex data, so we can add features, delete features, and see if the complexity increases or decreases. If we have to combine rankings of features, the combination methods are different. We can use simple operations, such as the minimum. For example, if we have several rankings, we can see, okay, I have this feature word. What is its position in all the possible rankings? Okay, and which one was the minimum or best one? This one, so that will be its final position. The same with the mean. We can take the mean of all its positions or the median. In this case, we have to be careful because there are usually a lot of dice and dice are not easy to deal with. More sophisticated methods for combining rankings include SVM rank, which is a super vector machine-based method to learn rankings. Steward, which uses order statistics or robust rank aggregation, which is similar to Steward, but overcomes its limitations. And maybe you are thinking, but is this actually working? Is it worth it to train several models when we have methods that work good? Okay, I'll show you with a simple, small toy example. This is my lab code. I am using a subset of MNIST dataset. MNIST is a popular dataset for recognizing handwritten digits. So I have this subset of MNIST. I split in training and test. And then I build one decision tree, compute its accuracy, and then I build an ensemble using 15 individual models. And 500 samples to sub-sample, 200 features to sub-sample, et cetera. So I build the ensemble, and then I combine the results with majority vote and have the accuracy of the ensemble. So in this case, the accuracy of a single tree was around 51%, and the accuracy of the ensemble was 74%. So yeah, ensembles improve single results. And now I included a feature selection step before the classification step. So we removed half of the features. The rest is the same code as before, just including this part. And in this case, the accuracy of a single tree with feature selection was 52%, while the accuracy of the ensemble with feature selection was 75%. So again, the ensemble is better than the single tree. And if we compare this with the results in the previous slide, we can see that feature selection helps to improve the performance of the classifiers. And here we can see the evolution when we change the ensemble size from 2 to 50. The blue line is a single tree which is fixed, is constant. In red, we have the classification accuracy obtained by the ensemble. And in black, we have, for example, if I have an ensemble with 15 single models, I compute the accuracy of these 15 single models, and I put there the best one. And as you can see, the ensemble is much better than the best one of the individual learners, because maybe these learners are good in some part of the data, but they are not good in another part. But if you combine them, you will get better results. But don't use as less hammer to crack an art. Sometimes there is a simple solution. You have to take into account that building an ensemble is computationally expensive, and sometimes it's not worth it. I would recommend that before going for the ensemble, try other methods. If they work, it's okay. You save your resources. When I was attending this talk by Kuncheva at the end, she showed us how five different ensemble methods were better than decision tree, but much better, and we were like, oh, ensembles are so good. And then she showed us how a simple K&N, like K nearest neighbor, was even better, because it depends on the problem. There is not an individual method that is the best in every possible problem. And the most famous methods are not always the best. Round and force is quite popular now, but maybe it's not the solution for a given problem. If you want to know more about feature selection ensembles, this is my new book together with my colleague Amparo Alonso, who is here and gave a talk a couple of hours before, ago. And well, in here you can find more details about ensembles for feature selection with successful cases, et cetera. And now maybe you are thinking, okay, but we are in a conference about big data and I didn't say a word about big data. But I'll show you now how ensembles and big data are intrinsically related. If you have a small data problem that we can handle with typical tools, typical computer, but we decide to try an ensemble and the ensemble has an important number of individual learners, it becomes a big data problem because we need to replicate the model a lot of times, and we can use the big data tools that we have available. It becomes a big data problem. And the other side of the coin, if we have a big data problem and we cannot handle it with our resources, we can try to build some samples of data that can be handled in a regular way and train an ensemble, divide and conquer. Also, ensembles can be used directly with the big data problems. In fact, we have ensembles in Apache Spark and Lib Library and they also appear in Scikit Learn which is used with TensorFlow, deep learning, et cetera. And if you remember these two cases from the beginning, the Netflix prize was 100 million ratings that 400,000 users gave to 17,000 movies, so this is a big problem, a big data problem. And in the case of KDCAP-99, well, it goes in 99, but it was 5 million samples, so it's not small, I would say, and they were solved with ensembles. So now my take-home message, well, two take-home messages I have. The first one is that ensembles are here to stay, so they are successful and it is likely that they continue to be, so try them because, well, maybe you can think it's a waste of time, but they are successful, they work, so try them in your problems. I don't know why it says one and one, but it's two take-home messages. And don't focus only on classifier ensembles. I show you how to adapt the ensembles philosophy to feature selection, but it can be adapted to any other machine learning field, so if you have a problem, just give it a try and try to use an ensemble to see if it improves. And that's it, thank you very much. If you have any questions, I'll be happy to answer it.