 Welcome to the dark side of machine learning. In this short presentation, I'll cover the things that most often go wrong when people apply machine learning in the biomedical domain. So if you're not already familiar with machine learning, I highly recommend that you go watch my short introduction to the core concepts before continuing. If you've been following the field, you'll know that there's been an absolute explosion in new methods over the past couple of years, and you're probably asking yourself, do these methods work as claimed? And are the new methods better than the older methods? The short answer is that many of the methods show that they are better than nothing. What I mean by that is that they don't compare to any other methods. Instead, they benchmark their method, they compare to random performance, and they show that the method is indeed better than random. The problem with that is that being better than random can still mean being completely useless. Random is simply not a good baseline to compare to. A complex method being published can for that reason have marginal benefits over much simpler methods because a proper baseline was not used. Some methods do indeed compare to other methods, but only show that they are better than bad. In this case, the authors benchmark their method and show improvement over some existing methods. The problem is they're doing bad comparisons. For example, they may compare their performance to the performance published by the other methods in their original papers. However, those performance numbers were made in different benchmark sets so you're comparing completely incomparable numbers. Alternatively, they may re-benchmark methods but leave out the very best methods or run other people's methods the wrong way. The end result is that their new method looks good even though it is in fact not the state of the art. Another common problem in the literature is misleading performance numbers. Several performance metrics are used to measure how good a method is, two of the common ones being precision and recall the latter also being known as sensitivity. The precision is defined as true positives divided by true positives plus false positives, meaning that to get a high precision you need many true positives and few false positives. This is a problem because it means that this metric depends on the positive to negative ratio of your dataset. If you have a higher ratio, you'll get a higher precision for the exact same method. This is an issue because when we do machine learning we often work on artificially balanced datasets having for example a one to one ratio meaning one positive example for every negative example. On such a dataset it's not rare to see methods having 98% precision which sounds great. But a dataset like that might be a thousandfold enriched for positive examples compared to the real world situation where we are often looking for the needle in the haystack. And if you have a method with 98% precision on a dataset with one to one ratio, you in fact have only 5% precision on a dataset with one to a thousand. The last issue I want to cover is what I call complete failure. In other words, methods that simply don't work the way they were intended to. The reason is the way performance evaluation is done. Typically the authors have some large dataset which they randomly split to obtain a training set and the all important independent test set on which performance is evaluated. However, in biological applications there are two big problems with this. The first is interdependent examples which are already covered in the core concepts video. The problem here is that you have for example homology between examples making them not independent. So randomly splitting the dataset yields a test set that is not independent from the training set. Consequently the authors can overfit their models and still get great performance on the not independent test set thus not realizing their mistake at all. The other big problem is unknown biases in data. That means that the machine learning method can learn the wrong thing from the data. And since you have the same bias in the training and test set when doing random splitting the authors get a method that seemingly works but in reality it simply doesn't. So how can you solve these many issues? Firstly always compare your method to a baseline method. Do something like a simple logistic regression and that way you'll know whether it is your method that is good or just the problem you're working on that is easy. Compare to the best existing methods if there are methods around. That means first identify the state of the art which you can do by looking at neutral comparisons. These can come either from dedicated benchmark papers or from other methods papers that compare multiple earlier methods. Calculate the actual position on the real world problem. This means figure out what is the true positive to negative ratio in the application of your method and then calculate what would the position be on that not on a balanced data set. Do redundant reduction. Make sure that you don't have interdependent examples between your test set and your training set. And if at all possible try to get a truly independent test set from a different source that will also reveal if you've overtrained on some biases in your data set that then hopefully doesn't exist in the other data set. And lastly remember that if something seems too good to be true, it usually is, especially in machine learning. That's all I have to say about machine learning. If you want to learn more about supervised learning take a look at this presentation next. Thanks for your attention.