 Welcome to this short introduction to the core concepts of machine learning. Machine learning goes by many fancy names. You may have run into terms like deep learning or artificial intelligence, but in reality, it all comes down to fitting models to data. In this presentation, I'll cover some of the core concepts of machine learning and go over how it's important to do dataset splitting in order to correctly evaluate the performance of your methods, as well as the importance of working with independent data, especially in biomedical applications of machine learning. Starting with the core concepts. Whenever you do machine learning, you need to have a dataset of so-called examples, which is what you try to get the computer to learn. These could be protein sequences, they could be sentences from papers, or they could be any data that you can represent as a high-dimensional vector, for example, expression data. The examples can either be labeled, meaning that you've assigned categories to them, in which case you want to do supervised learning, trying to get the computer to learn to classify examples from the input data to assign the right labels. This could be as simple as doing logistic regression, or it could be training something like a random forest. If you have unlabeled examples, you're in the territory of unsupervised learning. That is, you're trying to get the computer to learn the structure of the data. This could be as simple as running a clustering algorithm, but it could also be to train, for example, a variational autoencoder on the data. Whenever you're doing machine learning, especially if you're doing supervised learning, it's important to consider how to split your dataset. You need to have a training set, which you use to fit the models. You will also generally want to have a validation set that you use to evaluate the various alternative models you've been training, and to select the best model. In the validation step, you can use an approach called cross-validation. That is, you split, for example, your data into five different partitions, and in a round-robin fashion, use all the data to validate the models predicted on the other part of the data. Regardless of whether you have a single validation set or you do cross-validation, you need an independent test set. This set is used to check the best model that you've selected based on the validation set and to ensure that you have no overfitting and thus get the final performance estimates that should tell you how well the model will generalize to new examples. When you make these different datasets, training, validation and test, it's typically done by doing a random split or partitioning of the examples. The problem here is that this approach assumes independence between the different examples. Unfortunately, interdependent data are common in biomedicine, and that is a big problem when you want to apply machine learning in this domain. The problem is that it leads to overestimated performance and thus to publication of bad methods that quite simply do not work anywhere near as well as claimed in the papers. To understand this, let's think about some examples. In sequence analysis, you might want to train machine learning to predict function. In this case, you need to worry about common ancestry in your dataset. For example, it would not be okay to train on a human protein and then test the performance on the mouse ortholog of the same protein. Similarly, in text mining, you may want to do sentence classification, in which case you should worry about common source of the sentences. That is, it would not be a good idea to train on one sentence and test on the very next sentence from the same paper. If you're working on protein interaction networks like I do, you need to worry about common nodes in your network. That is, when you have an interaction network of a complex ABC, it will be represented as this binary network. In this case, you cannot train on the interactions AB and AC and then go test the model on BC, since the three binary interactions are all related, typically both in terms of the experiment they come from and due to the fact that they are the same complex. There are fortunately solutions to these problems. In sequence analysis, you typically want to do redundancy reduction. That is, whenever you have two sequences in your dataset that are homologs, you remove one and you do that until you end up with a dataset with no homology between the examples that you're left with. That way, you can now safely do a random split of the examples. Alternatively, you can do smart splitting. In the text mining example, you could split based on documents rather than sentences, that way ensuring that all sentences from a given document will either be used for training or for test, not a mix. Similarly, in network analysis, you can do node-based splitting of the data, so instead of randomly splitting the interactions or edges into training and test, you split the nodes, the proteins, and only consider interactions between proteins in the same set for training or for test. That's all I have to say about machine learning today. If you want to learn more about the dangers and perils of analyzing networks, I suggest you watch this presentation next. Thanks for your attention.