 and welcome to this Machine Learning One course. My name is Dmitri Korbach. I am a researcher here in the University of Tübingen working on machine learning and its applications to neuroscience. This is actually the first time I'm recording a video lecture, so I think it will be a learning experience, not only for you, but also for me. So I want to start with saying a few words about this course. As I said, this is introduction to machine learning course. It's supposed to be very basic. Do not expect that after finishing this course you're an expert of machine learning. We will be making baby steps towards machine learning, and so many of you are in the master's program here in Tübingen, and in the following semesters you will choose to take other machine learning courses. So the purpose of this course here is largely to prepare you to be able to take any machine learning course that we offer and follow it. Or prepare you to be able to open a textbook in the middle of it, if you want to look something up and understand what's going on there. The second point is that we will have around 10 lectures covering a broad range of topics. So we'll hop a bit around over the machine learning textbook, trying to sample and to give you an idea of how machine learning looks and what it really is. This is going to be a mathematical course. So this means that usually in the pre-corona times when I'm teaching this material, I'm using whiteboard. And when I was preparing these slides for this course now, I tried to also keep this whiteboard filling. Let's see how it works. But it's focused on the math. We will have some practical assignments in Python throughout the course, three of them. But I'm not going to be talking in the lectures about any implementation or Python details or we'll have a separate time to talk about that. All the necessary mathematics we will introduce as we go. So the experience tells me that some of you will not be familiar with some of the math that we need. I will not have separate lectures to introduce these mathematics, but we'll just jump right in and whenever we need something, I'll try to explain how it works. If you still feel that you don't follow, that that's too much, you want to or you just prefer to read a bit more systematic, mathematical exposition, then one book I can recommend is here in Mathematics for Machine Learning. It is freely available online. Just Google it. You will find a website and covers everything that we might need and more. And finally, my aim here in this course is to introduce key concepts and develop key intuitions behind these concepts in machine learning. So the second thing that I want to say in the introduction is what is machine learning? What's this course about? So let's talk about that briefly. If you open Wikipedia and just read the definition, then the first sentence, it says machine learning is the study of computer algorithms that improve automatically through experience. Sounds good. What I think this sentence brings in mind is something like, I don't know, computer program, playing chess or playing goals, something like that or self-driving car, right? It somehow experiences different chess games, for example, and learns and improves to play and then can beat human players. So here's the one way to think about that is to contrast it with more traditional approaches to solve this kind of problem. So traditionally, and by traditionally in this case, I mean, I don't know, 30 years ago, if you want to program a chess engine or if you want to program a computer to, let's say, you present it with an image and the algorithm should say if it's a cat or a dog. Then in computer vision, 30 years ago, you would try to come up with rules that would be able to solve this problem, right? So maybe a cat has whiskers and ears like that. So you would try to build a pattern detector that will look for these ears and if they're in the image, then it's a cat. So you put in some data and you put in some rules that you as a programmer come up with and then the software gives you the answer. In the machine learning, it's a bit upside down, the same paradigm. You put in the data and you put in the goal or the answers if you want. So you give a computer an image and say that's a cat and here's a dog and then you repeat this one million times and hopefully afterwards, your algorithm will be able to discriminate a cat from a dog without you ever specifically designing a rule that would be able to do that. In some cases, maybe you don't even know how to distinguish a cat from a dog or you can't formulate it easily in words and the computer will just learn to do it. Okay, here is another definition from a textbook that's called machine learning. The goal of machine learning is to develop methods that can automatically detect patterns in data and then to use the uncovered patterns to predict future data or other outcomes of interest. That sounds a bit more general and then Murphy, the author, continues to say machine learning is thus closely related to the fields of statistics but differs slightly in terms of emphasis. So that's important and I also want to discuss it briefly. Of course, if you think what does statistic do, well, it detects patterns in the data and then tries to predict some outcomes of interest, the same definition would maybe apply to statistics or almost the same. So what is this different emphasis here between machine learning and statistics? And I should say that this is of course a topic where you can find a lot of discussions with some people saying statistics and machine learning is the same or they intersect and the part of statistics that is not machine learning is useless or the part of machine learning that is not statistics is just engineering and so on. So that's a very, very hardly debated topic. Let me offer my perspective here. So both statistics and machine learning aim to detect patterns in data as in that quote by building predictive models. The emphasis is slightly different here. So statistics will use this as a tool to learn something about the world. In statistics, it's called statistical inference. We want to infer some properties of the world that reveal themselves through the noisy observations, through the data. The focus, therefore, is on simple models, on interpretable models, such that after you fit the model, you can actually look at that and learn something, right? And since the models are relatively simple, one can develop a lot of theoretical analysis, work out how the estimation procedures will work under some assumptions, what are the statistical guarantees and so on. So if you're taking any statistics course in parallel with this, and I know some of you are, then you will see how this actually applies to what you're doing in that course. Machine learning uses the same approach. So we're building predictive models to detect patterns in the data, but uses this as a tool to actually make useful predictions in challenging situations and not so much to learn anything about the world. Therefore, it focuses usually on complicated and competitive models, deep neural network or something like that, uses very large data set, as long as your model works in practice, great, you achieved your goal. You usually do not try to learn anything about cats and dogs, but training a neural network to classify that, but you just want your system to be actually useful when somebody does a Google image search. So to a larger extent, we're giving up here on the statistical inference, but we're focusing more on the prediction. Here's a nice paper that I referenced here. From 20 years ago, statistical modeling the two cultures, but it's very influential, more philosophical paper about these two different approaches. I think it's still, even though 20 years passed and machine learning changed entirely over these 20 years, I think this conceptual distinction still holds. So this, of course, also means, at least in my opinion, that there is no hard boundary. It's more like a spectrum and you can arrange different predictive models on a spectrum, where on one side there is like a clearly statistics and on the other side there's a clearly machine learning. So just to give you an example, the clearly statistical tool would be a one-sample t-test, where you have a bunch of numbers and you want to test if they are, let's say, positive, actually if the average is above or below zero using a hypothesis test. In a way, you are fitting a model. The model has one parameter and that's the average, and then you want to infer whether this average is above or below zero. On the other side of the spectrum is GPT-3, the recent neural network, gigantic neural network trained on all the text available on the Internet that is famously able to generate very, very human-looking texts. And it's very, very hard to say how exactly it works on the mechanistic level. So at this point, we're definitely not trying to learn anything about the text, but we're just trying to generate human-looking texts. So that's clearly machine learning. And everything in between, I would say, or a lot of things in between belong to both fields. And so this is a cartoon that then describes how the machine learning extreme will look like. You just have an algorithm that's completely non-transparent. You put some data, you get some answers, but you are not doing any inference because it's not a black box. You know how the algorithm works. It's a neural network, for example. But it's so complicated that it doesn't give you any new knowledge about your data, whereas statistics, that's what statistics is after. So a lot of things, though, are in between. And for example, one of the textbooks, the most well-known textbooks on machine learning is called the Elements of Statistical Learning. So that's a term that even tries to combine statistics and machine learning in one term. By the way, about the textbooks, so the Elements of Statistical Learning is freely available online, the same about this first book here, Machine Learning, by Bishop. You can also find it online. Both are great textbooks. Murphy is also very good. You will find maybe that if that's your first time you're learning about machine learning, they may be a bit complicated, but as I said, hopefully by the end of the course, you'll be able to read them comfortably. Final introductory part is what are the types of machine learning problems, and you will see very often that people have these three big groups of machine learning problems. So let's introduce this terminology here. So first group is supervised learning problems. In the supervised learning, you have some data, like for example, photos of cats and dogs, the same example. They are labeled. So there's a photo and it says it's a cat. And there's another photo and it says it's a dog and so on. So the task is to distinguish one class from another, so maybe you have 100 different classes. Still, it's supervised because it's supervised by the labels. Unsupervised learning. Imagine the same data, but you don't have labels. You just have a bunch of images and that's all. You don't tell the algorithm where the cats and dogs are. And an example here would be, for example, that you want the program to figure out that there are two different kinds of animals or more, and essentially cluster your images by what animal it is. So that's unsupervised learning setting. And reinforcement learning is the third group where, for example, you want the... So where the program is more like an agent, it has to make decisions in some sort of game environment or exploration environment. For example, play chess, play God, drive a car. That's all reinforcement learning tasks. So in this course, we are not going to be talking about reinforcement learning, which has a bit of a different set of ideas behind it. You'd need to take a separate course to learn about that. We will be talking about the first two, mostly about supervised learning and less so about unsupervised learning, which is the same in any machine learning textbook that you will pick up. And that is because supervised learning is easier. As we will see throughout this course. I want to finish with giving this one famous quote from Jan Lakuna, a deep learning researcher, from a few years ago. Let me show this cake picture here. So he said most of human and animal learning is unsupervised learning. And then compares it with the cake and says that the bulk of the cake is unsupervised learning. And supervised learning is only the icing on the cake. Reinforcement learning is the cherry in the cake, referring to how the actual animal intelligence works. That's contentious. I think some people might disagree. But the idea is that if you are a baby and you learn something about the world, most input that you get doesn't have explicit labels. And you learn a lot from this input. You just look at the world and you somehow learn to categorize it. At least that's the idea that people have about how baby learns. And that's largely unsupervised. But that's a very difficult problem. And until very recently there was much more progress in supervised learning than in unsupervised learning. So this will also be our focus here in this introductory course. Okay. So with this we can jump in and we will start with linear regression. So that might sound a bit underwhelming for some of you because we're talking about machine learning. This is a technique that is 100 years old or more probably. Very classical statistical method, linear regression. However, I think that actually it's very useful to demonstrate a lot of concepts in machine learning. So we'll in fact spend like four lectures on linear regression because I will just use it as a vehicle to introduce a lot of useful concepts that then appear everywhere in machine learning. And today we're really talking about a very, very simple situation. So simple linear regression is actually a technical term. It's not a linear regression that is simple. It means that you only have one predictor. So you have some, let's choose some example. You want to predict a height of a person from let's say the age of a person. So x in this case will be the age and y will be the height. And if you collect the data, then you will see that they're positively correlated. The larger the age, the higher the person is. Let's say the data, that's how the data looks like. And we want to build, we want to fit a linear function here. So a linear function will have two parameters. One parameter is called an intercept. That's beta zero here. That's where this line crosses the y axis. And the second parameter is the slope. So that's just some terminology here. That's a supervised learning problem, of course. It's because we have the y. This is not a classification problem. So we don't want to classify two groups of objects here, one from another. It's called a regression problem because we're predicting continuous variable y height. So we have some training data, which is a set of pairs, x, i, y, i, where i goes from one to n, and is the sample size. And we want to fit this model, right? So we want to fit the linear function to the data. So this means we want to choose the beta zero and beta one. We want to find the values of these coefficients that somehow yield the line that describes our data well. So yes, as I said, our prediction, the value of f of x, y should be close to y, i. So how would we do it? How is even, what is the way to mathematically formulate this problem, right? As opposed to just say it should be a good fit. So how can we formalize the notion of a good fit? That's a central object here in machine learning is the loss function. So we want to write some function to describe how well our given model with given coefficients describes the data. And we want to tinker with a model until it fits well enough or until it's the best possible fit. So the loss function also calls the cost function sometimes. And let me just give it to you here directly, the loss function of linear regression. So let's look at that. That's the sum of squared values. And it's the sum of all training examples from 1 to n of the squared deviation between the actual value and so the actual value y, i here and our prediction f from x, y. And then we square, sum it up and divide by n so it's the average of these squared deviations. And our prediction is of course just beta 0 plus beta 1. So that's what we're getting over here. This is called mean squared error, right? Mean squared error. So a good question here is why are we using mean squared error as a loss function for linear regression? And I'm actually not going to answer this now. I suggest that you though spend a few moments thinking what are the possible loss functions one could have chosen alternatively. So it's not obvious that that's the best or the optimal in some sense loss function. I could have used the third power and not the second power. I could have used the absolute value for the first power but of course I want to penalize the positive errors and I also want to penalize the negative errors so I can just put the absolute value and sum that up. Or maybe the fourth power or maybe some other function of this difference or maybe it should not be the sum but the product. You can come up with a lot of functions that will somehow be small if the fit is good and large when the fit is bad, which is what we want. So why are we using this particular function? There can be several answers. One of them that is just computationally very easy to work with and then you will see some other answers appearing in later lectures. For now we'll just try it out with this loss function and see how it goes. Another term that you might hear when you read on that topic is that this is called ordinary least squares regression or estimation problem. You want to find f that minimizes this in terms of the squared error so it will be least squares. Okay. And now we will do another step and instead of simple linear regression we will consider what I will call a baby linear regression and that's now not a standard term which is, I forget about the intercept. Let's try to fit even simpler model than that. So ridiculously simple model which just has one parameter and that's a beta and the loss function is the same but now also just has one parameter beta and of course if I plot it as a function of x then we want to fit a straight line through this point cloud but the straight line has to go through the 00 point because the intercept is just constrained to be zero. So I'm not aware of any simpler setting than that. We should be able to learn anything about that before we start talking about more complicated problems. So let's think and this is really important about the loss function. The loss function is a function of what? The first question. The loss function is the function of parameters of beta. When you change the parameters the loss changes. So in this case if you look at the loss that I've written up there you imagine opening the brackets, the squares and then summing it all up you will get a polynomial of beta which is a quadratic polynomial. So you will have some terms with beta squared then there are some terms with beta in front of it and some leftover terms with no beta. So the coefficients will be complicated because it will be sums over all samples of xi and yi and so on but it's still just a quadratic function that you're studying in school. So if we draw a cartoon of that with beta on the y-axis and loss on the y-axis so beta on the x and loss on the y-axis then it's a parabola which has a minimum and that minimum is what we want to find. So in statistics people write beta hat to denote the estimate given the training data you want to estimate the beta there's some true beta perhaps that we don't know but given the training data we can estimate the beta and given this loss function our estimate should be just the minimum of this loss, beta hat. The question is how to find it you have your y-axis you have your y's how do you find beta that's what we want to talk about. So there's several ways to approach them and I will introduce a funny way to approach that maybe so for this problem it might look funny because you know if you have a parabola you can find where the minimum is like it's a school math but let's approach it with the notion that we will later have more complicated problems where maybe to solve to find the minimum is not so easy and the standard tool or the simplest actually tool in this case would be gradient descent so when we apply gradient descent to our baby linear regression we get baby gradient descent let's talk about that that's the same parabola as before and imagine that you have initial guess about what the beta could be and it's not very good guess it's just somewhere doesn't matter maybe random so you start with some value of beta over here and then what you're going to do is you will look at the derivative of your function as a guide for you in which direction you should change your beta so if the derivative here is positive as it is then you should go to the left you should decrease your beta whereas if you start over there and the derivative is negative then you should go to the right then you make a step and then you reconsider what is the derivative here it's still positive well you go more to the right still negative sorry you go more to the right if you are end up over here at some point then okay the derivative is positive so I go to the left if by chance you end up directly at the minimum then the derivative here is zero anywhere which means you're done you found your minimum so let's then formalize that I will talk about this second picture in a moment and write down this update rule so I start with some beta and then I'm looking at the derivative so over here the derivative of the loss with respect to beta and if it is positive then I need to decrease my beta that's why there's a minus over here and the eta is just some number that we choose in advance and that's called the learning rate and that's it that's the baby gradient descent it's super easy you can program it in basically three lines of code it's a simple for loop and it will give you the minimum of your function eventually the conceptual question is will this always work is it always a good idea to do it like that and that's my second image tells you that perhaps not so you can have a function that looks like that on the right and if you start here then the gradient descent will bring you down here which is good that's the global minimum but if you start in the bad spot then you will descend over here to this local minimum and you will never go up because the gradient descent will never want to go up and you will be stuck here in the bad local minimum so a function like that is called non-convex and it's tricky to optimize a non-convex function at least to find the global minimum of the non-convex of a non-convex function using gradient descent and in fact using any other method too so luckily in baby linear regression also in simple and in any linear regression the function is convex so it never has this problem because it's a quadratic function at least so far it's a quadratic function so it just has one local minimum which is also the global minimum so great we can use gradient descent to solve this problem very effectively the learning rate so what is I said it has to be chosen in advance what does it do I mean it's clear what it does right it just governs how big steps you do but what should you keep in mind anything when you set it up yes here's what can happen if your learning rate is too large you start at the initial point down here and then you see that you need to go left and you go left but the step that you're making is too large and what can happen is that you jump on the other side and then here oh now the derivative is what is it? negative so you need to go to the right which is correct but if you make the step too large you end up here and then here and then there this will never stop and you will just diverge away from the minimum instead of converging towards the minimum so if your learning rate is too large then you can have a divergence problem on the other hand so you might think well I just choose a very very small learning rate which is often a good idea however if it's too small then you will just be very slow converging you will make tiny steps and it will take you a long long long long time to converge towards the minimum which is also not optimal so in fact it can be tricky to choose the good value of a learning rate for any given problem for some complicated problems later on for neural networks people just experiment of what is the good value of the learning rate what works better and then if you see that your loss function starts to grow then you know okay you probably should decrease your learning rate okay let's say you chose a learning rate that works okay and your loss function decreases as you do this for loop so you decrease your beta for example a little bit based on the derivative then you look at the derivative again you decrease it a bit more you look at the derivative you decrease it a bit more and so on so you have iterations here on the X axis and the loss on the Y axis and it just goes down if it goes down it means everything works well it's not diverging the question is though where do you stop so you never or with this kind of method the gradient descent you will never arrive exactly to the minimum perhaps by chance very unlikely that you will get a point where the derivative is exactly zero so it will just keep decreasing and you need to stop it at some point so you need to stop in criterion which can be different things you can say if your if the loss decreases from one iteration to the next by just a tiny amount maybe I don't know one percent then it's good that I'm stopping that can be a stopping criterion you can say if the derivative is smaller than some limit then we are close to the minimum to the flat point of this of the loss curve and we can stop or in some cases you just say whatever I'm just doing a thousand iterations and then I'm done which is also a stopping criterion you just do a thousand iterations then but if you program that you need to set up some rule about where your for loop will stop or while loop okay so now if we want to apply this to our baby linear regression problem we need to actually compute the derivative so I was always writing derivative derivative but what is this derivative so let's do it so you have a loss function here it is again we need to compute the derivative it's super easy of course right you just differentiate the square which gives you so here I wrote it all down so you get a two it goes in the front then you get the same bracket and then within the bracket to remember that it's a derivative with respect to beta not X, beta so you get this minus X sitting over here and then we can rewrite this a little bit like that and that's it so that's if you have all your X i values and the Y i values which remember it's just the training data the data that you have then and the beta is the current value of your beta then you can compute that sum you get a number that's number is the gradient and then you apply the update rule and you update your beta and that's all you need to program gradient descent in this simple situation of course as I said already before in this simple situation we can work out analytical solution we don't actually need to do gradient descent if you want to work out the analytical solution you in fact still need the gradient you still need the derivative and then you're asking where is the value of beta such that this derivative is zero because in that case that's the minimum here so we just say at the minimum this function is zero and I will put a hat already here on the beta in this equation because this means that when it's zero that's the estimate that we're after so the beta can have a hat okay and that's just a very simple equation that one can solve and this is the solution to our baby linear regression so if you actually want to solve it you don't need to do gradient descent you can do that but if you do gradient descent you will converge to the beta hat hopefully now if this is all clear then we can go back to our a little bit more advanced situation of simple linear regression where we had two coefficients we also had the not only the slope but also the intercept the beta zero and let's try to see how this identical machinery that we developed generalizes to this to this situation so now we have if you have some data you imagine that you vary your lines and not only the slope varies right but also the intercept so you can move lines up and down you can move them left and right you can change the slope you have two parameters to describe these lines so you have a two-dimensional space over which you are optimizing so if you want to draw a loss function now it becomes a 3D drawing you have beta zero and beta one and for every combination of beta zero and beta one you have a value of your loss so it's a loss surface it's not anymore a loss curve I guess that we had before but it's a loss surface and we want the minimum of that so how does one find a minimum of a surface like that in three dimensions and the answer to that is that one still uses derivatives but one needs to use partial derivatives so let me briefly explain what it is for those of you who have never encountered partial derivatives before imagine that so we think about that in the same way as before so we have the surface a cup in 3D you're starting with some gas and then you want to go down towards the minimum but you have two things to change you can change beta zero and you can change beta one so what happens here is that we say well let's just do one up for another it's very simple really we just fix beta one for a second so let's imagine we fix beta one this corresponds to taking a cut through this 3D figure you just cut it along the line of the fixed beta one and you get a plane where only beta zero can change and that's essentially the situation we had before right beta zero can change for each beta zero you now have the value of your loss and you want to go down so you can compute the normal derivative in this cut plane and that by definition is the partial derivative of our loss with respect to beta zero and then you can do the same thing I don't have an image for that but you cut at 90 degrees along the beta one axis for fixed beta zero and then again you just obtain a one-dimensional function where you can compute the normal derivative and that's the partial derivative of this two-dimensional function with respect to beta one and then so we compute both and then we update beta zero and beta one with the same update rule gradient descent update rule using these two partial derivatives so we'll see how it works ah right so I'm saying gradient descent gradient descent but where's gradient so far there was no gradient so we can think about these two equations written as written down in the slide as in fact two components of one vector equation so let's think about beta now not as a two separate parameters beta zero and beta one but as a vector it's a vector in two dimensions has two coordinates beta zero and beta one um so I want to write this update rule in a way that has the beta um as a vector well it's very simple of course I can take these two partial derivatives in two dimensions and that's by definition is the gradient so the gradient and nothing more than a vector consisting of partial derivatives along each coordinate if our function has only two um variables beta zero and beta one then the gradient is a is a two-dimensional vector right so we can very conveniently write down this vector form which is now the equation that in principle can apply to any dimension additive but in our case here right here it's it's living in two dimensions um so how would it look like if you if you start iterating this gradient descent update rule so here's another way to visualize this the same the same loss function that I previously drew now I can look at it from the top if I look from the top then I'm making a plot that has beta zero and beta one um coordinates and I don't have the the loss coordinate anymore but I'm plotting it like on a map on a re on an actual map of the terrain where you have mountains right and then you have this concentric curves that tell you what is the height there so that's how you should think about this plot it tells you and where um where the value of L of the loss is larger or smaller and here it's so it's it's convenient to think about it like that because one can conveniently visualize what's going on if you start the gradient descent iterations you start somewhere and then for every point here or for a few points I drew these vectors and the vectors are the values of the gradient and the directions of them so if you start over here it's perpendicular in fact so I didn't prove that but it's perpendicular to this um to this curve that of constant uh height which I think is pretty intuitive if you think about that and you make a step in the direction of the gradient then you make a little step um and you look at the gradient again and you one can see I think hopefully you run into this divergence problem then you actually will converge towards the towards the minimum if you make two large steps you can also start hopping from one side to another of this valley and if you're learning this too large you might diverge so the same exactly the same things that we said before apply also here in the two-dimensional situation um well in order to apply that you need to compute as we had before in the one-dimensional situation so that this is our loss with beta 0 and beta 1 so let's compute the two partial derivatives and that's in fact as simple as before just slightly more confusing because you need to remember where the 0 and 1 goes but you just when you compute the partial derivative with respect to beta 0 you just pretend that beta 1 is not before the sum in front of the sum and um and you just left with that the derivative with respect to beta 1 is slightly more complicated right because the beta 1 has x i in front of it so x i gets outside of the bracket so that's what you have and that again is all you need in order to program the gradient descent here so you think of you can think about that as a vector line the second coordinate given by the second line you're updating your beta vector by this vector on each step as an exercise one can think how to derive now the analytical solution for beta 0 hat and beta 1 hat so you say of course how do you do it you say well this equals to 0 and they should also equal to 0 so you have a system of two just a good old linear equation system that one can pretty easily open the brackets and solve and then you obtain some formula for beta 0 hat and beta 1 hat that obtain the minimum of your of your loss and this concludes the this concludes the first lecture thank you