 Welcome to this deep or dive into supervised learning. Supervised learning is a machine learning task in which you start from labeled data and use it to train a computer to predict some output of interest from some input you have available. In this presentation I'll cover the different types of supervised learning problems, I'll talk about how you encode the input data to make it possible to do machine learning on it, what it means to train a model and finally talk a little bit about the importance of data versus methods. If you're not already familiar with machine learning I recommend that you go watch my short introduction to the core concepts of machine learning before continuing with this presentation. There are two main types of supervised learning problems. The first is regression problems in which we're trying to predict values, that is we're fitting a model to from some input values predict one or more continuous valued outputs. A good example of that is trying to predict binding constants for small molecule compounds against some protein targets. The other and most common class of problem is classification problems. In this case we're trying to predict discrete labels instead, so in this case we would want to for example separate the red from the blue dots by making some decision surface that separates them. This is an example of binary classification separating blue from red in which case you're trying to put each example into one of two groups. You can also have multi-class classification in which case you're trying to put each example into one of multiple different groups. And finally you have multi-label classification in which case you assign zero or more labels to each example. A good example of the latter is protein function prediction where you will often want to assign multiple different go terms to the same protein. When you're doing supervised learning and machine learning in general, data encoding is important. For machine learning everything is vectors. The algorithms have never heard of letters, amino acids or nucleotides. So for this reason whenever we're working on something like sequence data we typically use what is called one-hot encoding. If we have a new nucleotide sequence we can encode the letter G by the vector one zero zero zero. We can encode A as zero one zero zero, T as zero zero one zero and C as zero zero zero one. This means that for nucleotide sequences we need four dimensions per position and if we're working on amino acid sequences we need 20 dimensions per position since we have 20 different amino acids. The problem with this is that for longer sequences we very quickly get into very high dimensional input spaces which are more difficult to fit models in. It's also important to be aware that better encoding really matters because if you have better encoding you have an easier problem. Effectively by changing the encoding you're moving around the points, putting the blue points closer to each other and the red points closer to each other thereby making it easier to separate them. For this reason there has over the years been a lot of focus on different types of feature engineering for example making use of domain knowledge such as which amino acids are more similar to make better encodings. Recently people have started using so-called representation learning instead which is a deep learning technique where you try to use machine learning to actually learn the input vectors themselves. Model training is the task of making the actual model and there are many different methods for doing that but it's important to be aware that they are all based on the same fundamental idea. You have a mathematical model with many different parameters which if you're working with neural networks are typically called weights. You then optimize these parameters to minimize the prediction errors across the several training examples. There are many different methods. They have different models. They use different optimization methods for minimizing different error functions. These methods have different strengths and they also have different weaknesses and it's therefore important to be aware of them since you can make pure choices for any given task. However it's important to be aware of the importance of data versus methods. If you have a better data set you will of course get a better model. Similarly if you have a better method you can make a better model based on the same data. However when it comes to training supervised learning algorithms data is king. Time spent cleaning data gives much bigger improvements than time spent tweaking methods and the reason is quite simple. It's the garbage in garbage out principle. If you're starting from terrible data you will get a terrible model no matter what you do. That's all I have to say about supervised learning. If you want to learn more about machine learning in biomedical tasks take a look at this presentation next. Thanks for your attention.