 record. Yeah, it's a record. So I know the QLS students have already followed machine learning classes, but on different topics. So I will actually start by explaining the difference between what you did and what you were going to do. Okay. All right. So I would like to start by a small introduction on the question of what is machine learning? First, for those who have no idea about what that is, that's a good question to start from. So machine learning is what? It's a set of tools of algorithmic tools, but also of mathematical tools, which are used to learn from and make predictions about data. Okay. So of course, you are all aware that machine learning is directly connected to data and actually large amount of data. Okay. We call that big data. And we want to use this data to make predictions. And really, the keyword here is prediction. Okay. And I will insist strongly on that word many times and we will see why. So machine learning, what is that? It's how to learn from and make predictions about data. Okay. So the two key words are predictions and data here. And of course, we want to do that in an automatic way. Okay. And that's all the point. We want to design a method, a procedure that is able to extract interesting features. It's the interesting information about the data automatically. Okay. So I guess that most of you heard about machine learning about artificial intelligence already many times, but I'm not sure that it's very clear for all of you what are the subfields of machine learning, artificial intelligence. And first, what is the difference between artificial intelligence and machine learning? Okay. So I will here put AI for artificial intelligence. May I have a question? Yes. Of course. Okay. So how machine learning different from interpolation or interpolation? So this is exactly what I will discuss. Okay. So machine learning is about making predictions. And prediction is not only just interpolating. Okay. So we will exactly answer that question today, hopefully. Okay. Thank you. So AI is a huge field that started in the 50s, actually, and it means essentially everything that is used to make predictions. Okay. But AI is a bigger box, bigger field than machine learning. But actually, nowadays, machine learning is really the part of AI that is really moving forward the faster. Okay. This is where the most research is done. And this is a very exciting research field. So I advise you to know a bit about it. That's why you're here. So that's great. So in AI, you have a subpart, which is machine learning. Okay. So AI is a big thing. And there are things inside AI that are not machine learning. For example, what we call expert systems, which are inference systems that are using just very like logical trees of, sorry, trees of logical decisions. So it's just if then else, but like very complicated structures of if then else, and these are called expert systems that are really coded by hand. And if you have a rich enough structure of if then else, you get a system that starts to make rather complex decisions. But this is not what we want to study. We want to study things that make decisions automatically. You don't want to code yourself if then else. Okay. You want to use data in some way. And the procedure must find this if then else is, if you want, in a normal automatic way. So machine learning is a subpart. And there are essentially three, three huge boxes in machine learning. Okay. So the three boxes are supervised learning. This is what the course is about. Okay. And what is supervised learning is actually it means learning from labeled and we see what it means examples. So it's procedures that are fed with many, many examples. Okay. Examples are here dependent on what problem you want to solve. Okay. So one example you want to distinguish between automatically between images of dogs and cats. Okay. What you will do with supervised learning is first create a huge database of images that you label. That is for each of this image, you know, there is a cat or a dog on it and you give it to the procedure. Okay. When designing this big data set. And from this labeled examples, the labels here being dogs and cats, you want the procedure to learn to automatically distinguish the two types of images. Okay. And the key notion here is prediction in the sense that what you want at the end is that your final procedure is able on new images that were not seen in this training that I said in this data set that you used for to train the algorithm. You wanted to be able to predict the labels of these new images. Okay. So this is what we will focus on. Okay. Then you have the field of unsupervised learning, unsupervised learning. And here it means essentially learning from unlabeled data. Okay. So it's the same from unlabeled or raw data. Okay. So here the supervised learning, you want an algorithm that find patterns in raw data. Okay. Raw data could be images without any labels. These images are not given a label like dog, cat, house, whatever you want. They are just pure images. And you want that the algorithm automatically find patterns, which means, for example, common points in some images in the data set so that you find structure in your data set in an automatic way. Okay. So I know that the QLS students have already studied algorithms in this class. Okay. So typical tasks in this unsupervised learning are called, for example, clustering. So clustering means finding groups in raw data. Okay. You want to find clusters or dimensionality reduction. Okay. Where you want to find a more compressed representation of data without losing too much information. Okay. And the key procedure in this class is called PCA for principal component analysis. Okay. While in supervised learning, the type of tasks we will consider are essentially two. There is regression. Okay. And classification. We will see what it means. But essentially regression means doing predictions about variables about quantities that are continuous real numbers or real valued vectors. Okay. While in classification, you want to make predictions about discrete variables. That's it. Okay. So if you want classification, you could think of it as a special case of regression. But actually the type of strategies that we will use for classifications are different enough from what we will do for regression so that we give it another name. Okay. For example, this task of classifying images of dogs and cats is a classification task. Okay. A regression task could be let's say you have many features about some patients who are a medical doctor. You know, the weight, the height, the age, the many characteristics about your patients. And you want to use this data in order to infer some unknown quantity about these patients. For example, the probability to get a cancer. Okay. This probability is a real number between zero and one and therefore you are doing regression. Okay. All right. Do I write good enough so that you can read? I hope so. Otherwise, I can make more effort but I will be slower. There's a question in the chat. Yes. Don't worry. I see the chat. So regression just means doing inference, doing prediction about real valued variables, about continuous quantities. Okay. That take a continuous value. And the last box. Okay. Machine learning is what we call reinforcement learning. Okay. So the Q&A students also had a course about reinforcement learning with Antonio. Okay. So what is reinforcement learning? Reinforcement learning is about learning through interacting with an environment. Okay. So essentially, that's a nice framework to model what life is, but many other systems, of course, but a typical system you want to study with that are living systems. This is the field of research of Antonio. So essentially you have an agent, which means a system which interacts through some sensors with the environment. Okay. And so you get a signal from the environment that can lead to pain, that can lead to reward, that can lead to nothing. And from this spatial information that you get through this possibly noisy interactions with your environment, the agents need to learn how to behave in a way that maximizes some reward. So reward just means the quantity that you want to maximize. So typically living systems on the long run living system is trying to maximize its, let's say, lifespan. Okay. Or happiness or whatever. Okay. So you define some quantity that's called reward, which is usually a real number. And the agent through interacting with this stochastic environment needs to define a strategy, what we call a policy, such that is able to maximize over the long run the reward. Okay. So this is also what is used in robotics. Okay. A robot has something interacting with a complex environment. The environment is random, stochastic. There is many sources of noise, many uncertainties. And still the robot needs to reach some goal. Okay. And to do so, you need to, in an online manner, to define a policy and strategy, a behavior, and this is what reinforcement learning is about. Okay. So learning by interacting with the environment. Okay. Shaping a behavior. And the behavior is what we call also a policy in the language of reinforcement learning. Okay. So now I would like to ask you, you have an idea of why machine learning became so popular in the last, let's say, 10 years, actually. There's been really a revolution going on in the last 10 years. And do you have any idea of why that's the case? So let's say why machine learning is so popular since around 2010. What are the things that have changed in the world? So that machine learning became so attractive to many scientists, why the research field exploded, while let me be clear on something. Machine learning here, I put it as a subpart of AI, but actually all that is a subpart of statistics more generally. Okay. Machine learning is based on statistical tools. Okay. All that is about statistics. But why we call now not just statistics. Why is that called machine learning? And why did things change so quickly? What happened? Do someone have reasons? Computational powers became better. We have a lot of data and it's also cheaper than employing many people to do the same amount of work that a computer can do in a longer period of time. Yes. So one thing is technological advance, which is indeed an increase in computational power and actually an exponential increase in computational power, and in particular the development of what we call GPUs, graphical processing units, which are rather cheap processors, which are able to do very simple computations, but in a very efficient and fast way. And actually most modern machine learning algorithms such as deep learning, neural networks and all that can be written in a way that they interact very naturally with these GPUs. And so you can learn, you can train complex learning algorithms with these GPUs in a very fast and efficient way. Okay. So we have the power to train these machines because don't think that machine learning is magic and that it solves every problem you want. That's not true at all. And this is what the course will be about is to understand that it's not magic. It's not a black box that is able to solve any problem you want. Okay. For it to work, you need computational power. Let's say have that, but then you need clean and enough data. Okay. You cannot extract information from something if there is not information in this thing, in this data set. Okay. This data set is too scarce, too noisy. It's not enough to for the task that you want to solve. You can use the fanciest algorithm that exists, 25 layers of neural network. If you want, you will get nothing. Okay. So the quality of your algorithm will also depend on the quality of the data. So indeed, the second thing is that we are in the era of big data. Okay. We have a lot of data. And this is, this has exploded since the around 2000. Okay. And now we have data on everything you can think of. Okay. And so we have also, so this computational power, we have an exponential increase in memory also. Okay. Because it's nice to be able to process data, but you need to be able to store it in some way. Okay. And now we have ways to store huge amount of data. Okay. But keep in mind that the theory of machine learning, let me call that machine learning, but actually we should call it statistics at this stage. But let's say the theory of neural networks as a typical example of a machine learning algorithm, the theory of neural networks is not something new. Okay. So neural networks, if I draw here the number of publications concerning neural networks as a function of time. Okay. So here I will put, when do you think the study of neural networks has started? 1970s. No, you're very far. Oh, sorry. Started in the 60s. Okay. A decade of it. 70s, 80s, 90s. Okay. And around 2000. And something really happened around 2012. Okay. And then I will put here 2015. And here I will put 2017 because things are accelerating. So I need to put here the scale is not linear. So the number of publications in let's say neural networks or machine learning broadly is a curve that looks around like this. Okay. There's been a peak around the 80s that died out. Okay. And around 2012 things have exploded. Okay. Exponentially. So what happened is around the 60s people studied what we call the perceptron. Okay. Which is the simplest possible network that we will study in detail in this course. And the perceptron is just one neuron. Essentially, it's the simplest model of a single neuron processing data. Okay. And at this time, people didn't have huge computers and all that. They really, you know, with the, I mean, maybe I can show you the picture. I mean, people were really encoding neural network by plugging cables in a very complicated manner. And what we'll code now in one line was like a room at this time. But people were playing with a simple neural networks in the 60s and were able to solve very simple classification tasks, binary classification tasks between two classes. And so yeah, this was around the 60s. And around the 80s, there is this bump in activity because people started to develop theoretical models for that. Okay. And in particular, here, people were playing theoreticians were playing with things called like the upfield model. Okay. So the upfield model was the first model coming from statistical physics. That where that was a simple, let's say, idealized version of what we call a model of memory. Okay. So it's a model that is it's a rather complex neural network, much more complex than the percept that is able to store patterns. And according to some dynamics to recover these patterns. Okay. So that was the first cause connection also with with the no science. Okay. And so people studied to death this whole field model and related books. Okay. But then you see that the gradient here became negative, the less and less research was carried on carried on this models after this bump, because essentially, we said everything we could at the theoretical level. But there were no real ways to implement these machines. Okay. We didn't have the computational power to use these machines in a in an efficient way to retest. Okay. The computers were not just powerful. But then the GPUs appeared around here. Okay. And so people had the tool to really code more complex networks to solve complex tasks. And in 2012, okay, that was the birth of what we call deep neural networks essentially. And people for the first time saw extremely complicated tasks, which was the classification of huge datasets of images. Okay. With hundreds or thousands, I don't remember, let's say hundreds of different labels, okay, different classes of images, you got thousands and thousands and thousands of images to classify. And these machines were able for the first time to solve this task around 2012. And around 2015, we reached superhuman performances. It means that on classifications of complex images for the first time, the neural network was able to bet to perform better than a human. In the sense that if you are giving the same datasets to random people in the population, and you are asking them to classify these images according to all the classes, people do mistakes, because sometimes you can confuse a dog for, I don't know what other animal or, you know, you can make a loss. And these computers, these algorithms at this time made less error than humans. Okay. And since that, of course, there are many, many tasks where these algorithms became more efficient than us. In 2018, this is AlphaGo. So you must have heard about that. This is the first reinforcement learning algorithm, but that also used neural networks that was able to solve an extremely complex game that we thought unreachable for for further decays. But that was a mistake, which is the Go, which is a game, a complex Chinese game, okay, a board game, which is a combinatorial problem with an exponential number, very large number of possible plays. And so this algorithm is not able to just test all possible configurations to find the best one, which is done for chess. Chess was sold by this by simple algorithm a long time ago before that. Go is much more complex. For the first time, it was actually better than the best player who was a Korean player. And since that the revolution is continuing, okay, and self-driving cars, whatever, and you know about that. All right, so let me clarify a further point. Is the difference between machine learning versus classical statistics? Okay, so what is the difference between statistics and machine learning? So the difference is the following. In statistics, may have questions. Yes. Why didn't you classify semi supervised learning in the categories as the fourth categories? Supervised learning is here. As semi supervised learning, like between supervised learning and unsupervised learning. This is not semi supervised learning. It's unsupervised learning. No, I mean, okay, so now we have three categories. Why didn't you put semi supervised learning as the fourth category? I mean, if we can create as many categories, I mean, you know, it's just arbitrary choice, but of course you can create further subcategories as many as you want. At the end, you will create one category per algorithm, but semi supervised learning is I want you to be there in between supervised and unsupervised. I could create many more categories. Of course, this is a rough distinction, but actually, classically, we divide machine learning in this way, but of course there are many sub parts. Okay. Yes, thank you. Okay, so in statistics, the type of questions you are interested in are questions related to estimation. Okay. So estimation means what? It means that there is some unknown quantity that you are interested in. Okay. And what you do is to gather data, which contain information about this unknown quantity. So for example, you have some process that generates data and this process depends on some quantity that is unknown. Okay. And what you do is that you gather a lot of data and you want to estimate to infer. Okay, this is also what we call inference. This unknown quantity. Okay. So how to use data to estimate unknown. This is also called statistical inference. Okay. And classically, this is what I meant here by classical statistics. Classically, the quantity that you want to estimate, let me say the unknown, it could be a vector of unknowns. The unknown you want to estimate is low dimensional. Okay. Low dimensional means that there are just few unknowns you want to recover. And by few, I mean compared to the size of the data set. Okay. Low dimensional compared to size of data set. Okay. So typically you have, let's say, the unknown's quantity are, let me call them X. This is a vector that generally belongs in RP, let's say. This is the unknown or what we like to call signal also in signal processing. Okay. And then you have data that depends on these unknowns. Let's say this data is much, it has a higher dimension, RM, this is the data. Okay. And the classical regime of statistics, estimation, classical statistics is concerned with regime where M is much bigger than P. And this is what I meant by low dimensional. Okay. P is low dimensional with respect to N. Okay. So I have a lot of data to infer to estimate few quantities. Okay. And really what interests you here is to recover this unknown quantity. Okay. You want to find these unknowns. This is what you are interested in. So examples could be what? Could be, I don't know. Let's say you record many trajectories of folding objects in different setups. Okay. Let's say from different initial conditions. And you know the laws of motions. Okay. You know Newton's laws. But you don't know the gravitational constant. Okay. So you record all these trajectories and you want to recover the unknown, which is the gravitational constant. Here there is just a scalar number that you want to estimate from this large amount of data, which are these trajectories. Okay. Please. Yes. Y of X is the future map of X, right? Sorry? Y of X is the future map of X. No, no, no, no. Y of X is just data that depends on the unknowns. X belongs to RP. X is in RP, yeah. But Y of X dimension is greater than X dimension. It's like we start with like a signal with five. When we compute the future map, we get something like... No, no, I didn't mention any future map or anything here. I'm not talking about future maps. I think I'm doing something much simpler. Here I'm not talking about machine learning. I'm talking about statistics. Just saying you have some unknowns. Okay. You generate data according to some process. This is a data generating process. Okay. Yes. So at the end, you get data that depends on these unknowns that are correlated in some way to these unknowns. And your task is given this large amount of data. This m data points is to recover this few unknowns of dimension p. Okay. Yes, I get it. Okay. Can someone give me another example? Is it like when you have parameters of a distribution? Yeah. So, for example, estimating parameters of a distribution could be that. It could be... It's a problem of this form. Yeah. So, for example, if you are a physicist, who are in a lab, and you're doing experiments, who are you doing interferometry, and you want to measure the speed of light, okay? Or you're doing... And you do a lot of interferometry and you gather a lot of data and you want to just estimate a number. And you know the laws. You know this data generating process in the sense that you assume how the data is connected to the speed of light. You measure this data and you estimate the speed of light out of it. Okay. All right. And really, the interest you have is to recover precisely the value of the unknown. Okay. So, this is estimation in France. This is classical statistics. Instead, what we are interested in this course is machine learning. So, what's the difference? The difference is that you're not necessarily interested in estimating precisely some unknown. What you are interested in is to use data to make predictions. And you see this is a very different concept. Okay. It's not that there is some unknown that you are interested in to precisely recover. It's just that you want to design some models, some algorithms, some procedures that will exploit your data in a way that you are able to predict some quantities, some interesting features, some interesting stuff about new unseen data that was not used to train your model. You see, it's very different. It's not about estimating something. It's about prediction on new unseen data. Okay. But of course, machine learning will use a lot of methods from statistics. Okay. But still, really, the idea, the core question is different. Okay. And also usually in machine learning, we are concerned in very high-dimensional regions. What does that mean? It means that you will have a lot of data, okay, to train your model. But now your model will also be parameterized. I will call the parameters of the model theta, okay, which are quantities you will need to fit, to tune in order to set up your model that will then be used to prediction. So, we'll generically call these parameters theta. And this theta, let's say that it is of dimension, let me call it N, for example. And N is also very large. Okay. We are not anymore in a regime where we want to find something of low dimension. We will also want to find something of rather high dimension. Okay. But you see these parameters, they may represent things that are maybe don't have a physical meaning or whatever. They're just parameters that parameterize your learning system. Okay. They do not represent a physical quantity with a precise meaning that you want to recover, like here, the gravitational constant or the speed of light. These parameters are just parameters that parameterize a model. Okay. And the task is to find a model able to predict correctly for new and seen data. Okay. Okay. All right. So is there any question before I enter a bit more formal definitions? This was the introduction. So now I would like to, before, so I already mentioned if you apply, ah, maybe I can show you something, actually. I wanted to show something once again. Maybe this will give a nice picture of what's going on. Give me one second. I need to connect with this computer. Take it as a two minute pause. To show you one or two nice slides to illustrate what we said already, but we cannot connect to Internet. I'm working with two computers and my iPads, which is like super not efficient, but this is the best I can do. And we have networks problem, of course, just to add a layer of complexity. All right. So maybe I would show, okay. I will show you this thing next time. It's just that I wanted to illustrate the classification of dogs and cats together that were just nice images, but I think you will get the idea. I will show you next time. All right. So let me now be a bit more formal. Okay. So let's start apart. I will call the machine learning ingredients. By the way, I sent you that, but I am very closely following these notes that I sent you. Okay. This review, which is very well written. So if there are things that are not here, I advise you to read that and also to check independently of what I say, to complement and maybe to read that before each course. So as you feel. All right. So when we set up a machine learning problem, there are essentially three parts, three main parts. And this is common to all tasks in supervised machine learning. Okay. The procedure is always the same. So let me right away give that and then we will discuss an actual application and code a bit to see what is going on. Okay. So the first main ingredient of a machine learning setup are what I call the observables or the measurable quantities or simply the data. Actually, this is part of the data. I will see what I mean by that. Okay. Let me remove this part. Just observables are measuring measurable quantities. And I will call that x. So one measurable quantity or one observable is a vector in RP. Okay. Where P here is the number of features. Okay. So the features are what are the components of this vector. And features are assumed a priori to be interesting to describe the model. Sorry, the system. I don't want to use the word model here. Or for the task. Okay. So features are things you can access. Okay. These are data you can gather that you think may contain information connecting to some tasks you want to solve. So let's take one example. So if you are again a medical doctor. Okay. And you want to predict the sensibility to certain person to certain disease. Okay. What we will naturally do is measure is to access all the possible characteristics you can think of about this person age, height, weight, origin, whatever. Okay. And these are things that you think may connect in some way to the presence or not of this disease. Okay. So these are called features. These quantities that you gather. Okay. But maybe there are certain quantities that you do not think at all to be interesting for the task you want to solve. You want to predict. You want to design a model able to automatically predict if someone will get cancer or not. You could use is let's say the fact that is a lion or Sagittarius or the time the precise time at which this person was born or let's say the number of friends this person has on Facebook, whatever you could measure that you could ask this question. But it looks totally irrelevant. Okay. You would not do that. You would not gather this species of information because they seem not connected in any way to the presence or not of a cancer. And I would tend to agree with you. So you will not call these features and you will just not measure them. Okay. So the features are what you gather and think may be connected in some way to what you want to predict. Okay. If you want to solve a task of classification of images of dogs and cats, what are the features in this case? If someone tried to tell me the distance between their eyes, viscose maybe in the case of cats, I don't know. Yeah, but this is this is rather complicated to to obtain. So it means that you would need to take each image and then to measure each of these things between, you know, rather vague concepts. I mean, this seems very complicated. And I see the answer in the chat and the answer is correct. The features in this case are just the pixels, the pixel values. So if you have images, you consider that the information is contained in the pixels and you don't want necessarily to pre process in a complicated manner and to pre extract information in an arbitrary way from these pixels. What you want, what you want is to design a model that will do that automatically for you. Okay. So what you give it in this case are just the images themselves. And in this case, the images are parameterized by very large vectors, which contain the values of the different pixels. Okay. So the features in this case are just the pixel values. Okay. So in our example of, I gave you an example of a problem from estimation, which was, let's say you have many trajectories of falling objects. And we want to estimate the gravitational constant from the knowledge of the laws of motions. Okay. Let's think about another related application. Let's say now we're stupid. We don't know the laws of motions. And we are still able to measure all these complicated trajectories under different setups. And we want to solve a prediction task. In this case, what could be a prediction task? It could be given an initial condition. So an initial direction and an initial velocity. Can you predict the end point? Okay. At which point on the floor, the object will fall. Okay. So this is a prediction problem. So you want to design a model that will automatically take as input this piece of information and predict the end point. In this case, the features are the initial speed, the initial vector speed, the initial velocity and initial direction. Okay. So in this case, this would be a six-dimensional feature space. Okay. You have the vector for the direction and the vector velocity. Okay. Actually, you just need the velocity here. It's a three-dimensional vector. Okay. So we have the features. Now we have a second thing, which is what I will call the labels or the outputs. Okay. So this will be denoted by just Y. Okay. That is a real number. So again, if this real number is discrete, we talk about classification. If it's a real number, it's regression. Okay. So the label is essentially the quantity you want later to predict. Okay. But in your training dataset, you have access to these labels. This is why we call that supervised machine learning. Okay. So let me give examples again. In the case of classifications of dogs and cats, we have these images. The features are just the pixel values. And the label in this case is just the binary number, which is plus one if the image is a dog. If it's minus one, the image is a cat. Okay. And for a huge dataset, some human experts has classified all these images. So you have the answer for all these images. Okay. But then later on, on new data, which are not labeled, the prediction task will be to automatically output the correct label, even if you don't have access to it. Okay. Is it clear? Okay. So for each, so the dataset then that I will call D is a set where associated to each input XI. Okay. So the dataset is a set of pairs where the input is a p-dimensional vector and the associated output or label is a real or discrete number. Okay. And high goes from one to N or maybe I will call it capital N. This capital N is the number of data points and each data point is what is a pair of input and output. Okay. So this could be the image. And this is, let's say in the classification task, this could be zero for dog and one for cat. Okay. Or in the problem of predicting the endpoint of trajectories, the inputs would be the initial vector. And for each trajectory, you would measure the final endpoint for many, many, many trajectories. Okay. And the label in this case would be the value of the endpoint. Okay. The position of the endpoint of the trajectory. Okay. I don't understand why you put zero for dog and one for cat because I think you're based on the base cell so that you're based on the proportion of the three main color, right? I didn't understand. Okay. So I thought that you're based on the pixel to classify the dog or the cat so that you're based on the proportion of the colors that contribute into, if I'm not wrong, you have the image of the dog and the cat and then you have the pixel of each image. And then you will assign a number to each pixel based on the color, based on the color that like three main colors that we, yeah. Let's speak in gray, in black and white. And so what is the question? Okay. So now it's a gray and a dark and white. So it's clear. That is zero and one. Of course. Yeah. Initially I thought it's a color. But if you have color, the major of course, instead of, you know, feeding the algorithm with just one vector, it would be three vectors or maybe you can set all these three vectors in a single one, a bigger size. Okay. You can always do that. There is no problem. But generically, I will call P, okay, the dimension of the inputs and we will always, in this course, consider simple applications where the input can be think as just three dimensional vectors. Of course, in more complicated tasks, maybe the data is not represented directly as a vector, but you need to find a way to represent it efficiently to fit your algorithms. Okay. So also, for example, in applications where really you want to process data images, okay, in reality, what you would do is not to vectorize the image because if you vectorize the image, you are losing the two dimensional structure. And of course, many interesting features are contained in the two dimensional structure. Okay, to vectorize that you totally lose this information. So you have special, in this case, neural networks that take as input not the vectors, but matrices, okay, and will exploit the two dimensional structure. Okay. Or if you have images with colors, then the inputs are not matrices, but there are tensors, which means you have three dimensions. Okay, you have the dimension of the color, which is a three dimensional space. And then you have the dimension of the pixels. Okay. So you can feed algorithms with tensors, there is no problem. Okay. But I will just consider a simpler setting where the inputs are just vectors. Okay. And actually you can take whatever data you want and you can vectorize it. Okay. But you may lose information if you do so in certain application, you don't want to do that. But I will at the moment say that the inputs can always be felt as vectors. Okay. Yes. Can I ask another question? Yes. Yes. So in the label or output, it could it can also be vectors. Say it again. Here you put that the label or output is in R. But can it also be a vector? Yeah. The output can also be a vector. Yeah. Thank you. So here I just want to emphasize that the output is of lower dimension than the input. So I think of P as something rather large. But here I could put if you want a little m, R to the m, and m, yeah, you could have as output a small vector, no problem. Okay. But at the moment, I would just consider these are as numbers. Okay. All right. Then I will construct the matrix of inputs. Okay. X by concatenating all these inputs. Okay. I have n of them. So each time I write a vector, I think of it as a column vector. So here I just created a matrix whose rows are the inputs. Okay. So this is a matrix of size, the number of data points times P. Okay. And n, I call it the number of data points or the number of samples. I will use mostly that word, samples, and this is the standard vocabulary. Okay. So concretely, this matrix is what? It's a huge matrix like this, where here you have x11. Here you have x1n. Okay. Here you have x21x2. Sorry. I'm getting confused. No. I'm putting x1x2x2 and so on and so forth. Okay. And the last one, yeah, x1pxnp. Okay. And so each of these columns is called a feature. This is the feature p. This is the feature 1. Okay. It's clear. So we'll always represent data in this form. Okay. The inputs. While the vector y is just, all right. So this is the first ingredient. We have data, okay, in a matrix form. We have this matrix x and the vector y. Okay. And so the data with this representation can be just, is just a pair of x and y. Okay. Inputs and outputs. Okay. The second key ingredient is a parametrized model. So now that we have this data, we need to design a model to process this data, to learn from this data, to predict for new unseen data. Okay. So I will generically call this model f. This is just a function that will take as input x, which means one input data point. So again, it's a vector in rp. But it will also take as input a vector of parameters. So this function is parametrized by parameters that I denote theta. That will be of size M. Okay. And it speeds an estimate, a prediction that I will or write as f of x theta. Or I will write it y hat theta of x or simply y hat. Okay. If there is no possible confusion. Okay. These are just three notations for the same thing. The hat here is a standard notation in statistics. The hat means estimated. Okay. This is your estimate. This is the estimate by the model of the output associated to the input x. Okay. So we are trying this function f just defines a generic rule between inputs and outputs. And all the point is to have a rule, a class of functions which is reaching up to represent complex relations between an output. And that we are able to learn. Okay. That we are able to learn. Yeah. So the point will be to find these parameters. Okay. This is the model parameters. Learning will be about finding these parameters, the best parameters to predict on new unseen data. Okay. Okay. So again, let me just say for the last time, if y belongs to an alphabet which is real, which means that the outputs are real number, we talk about regression. And if instead it's discrete. Okay. So the y hat the prediction belongs to a certain discrete set. Okay. Then we talk about classification. Okay. All right. And what I will call, let me define another set which is the set of functions f, f theta. So I will denote it f theta like this to emphasize that it's a parameterized model. Okay. So this set of functions, which are parameterized by theta. So for each value of this vector of parameters, you have a new function. Okay. Each time you change theta, you get a new function. So this defines an infinite set of functions if the parameters are continuous. I will call this infinite set of functions our hypothesis class. Okay. In the sense that from the beginning, by deciding to use a function of this form, you are making the hypothesis that such a function is a good model for the task at hand. Okay. We'll be a good predictive model if you are able to learn the parameters correctly. Right. So your, your, your final function, your final choice will belong to your hypothesis class. Okay. Which is a set of parameterized functions. So can someone give me an example of such a function f theta and what are the parameters in this case? Can, can someone give me an hypothesis class essentially that you know and, and I know that you know. So the linear function. For example, so the hypothesis class could be a set of functions f theta that could be over in the form theta zero plus, sorry. So you said, so for example, the set of polynomials theta one x plus theta two x squared. So for example, this could be a hypothesis class is the set of all polynomials of order two. Okay. It's a valid hypothesis class. Can someone give me another example? A neural network is an hypothesis class. Okay. We will write down at some point formally what is a neural network. I don't want to write it here, but a neural network is, is, is, is, is actually just a function at the end. A neural network is something that takes an input and that speeds an output. And this neural network has a kind of feed forward structure in layers like this. And what are the parameters in this case? What parameterized this rather rich and complicated function that we call a neural network? It is the weights and the biases. Okay. The weights that connect the neurons in a neural network are the parameters of an hypothesis class of neural networks. It's clear. Okay. And the task is once you define an hypothesis class that you think is good enough for your task, you need to find which function in this hypothesis class is the best one. So you need to find which parameters are the best in your case, even the data you have access to. Okay. Okay. So let me emphasize again. Once you define a function in your hypothesis class, this function will use after training, which means after finding theta hat. So I mean, by hat, I mean, again, once you fix, once you have chosen values for the parameters, it will be used to predict the output, the label of new unseen data points. Okay. For x unseen during training. Okay. And the learning problem is really finding the best. And we will see what we mean by the best theta, the best set of parameters. Okay. Which means the best function in your hypothesis class. Okay. Now we have a model. We have an hypothesis class. We need to learn now. Okay. We have data. We have a model. So we need to learn in some way and how to learn, which means how to define a procedure to find the best theta. Okay. What you need for that is a cost function or what we call also energy function. Why cost or energy? Because in physics, an energy function is something such that you want to minimize it. Okay. Systems tends to minimize their energy to find equilibrium or free energy, but some notions of energy. Okay. Here the cost function is also something that we want to optimize in particular to minimize. Okay. We want to minimize the cost. And I will write down what do I mean by the cost. Okay. So the cost is some function that depends on the label and your model applied to the associated input. Okay. When I don't write an index for x and y, it means the associated pair of x and y. Okay. The x here is the input that goes with the associated y or y is the label, x is the input. Okay. This is also written, discussed. It's a cost between the true label. Okay. For given data points and your estimate of the label according to your model. Okay. This is the output of your model. Right. And this is a function that belongs to the non-negative real numbers. Okay. So we will not allow this cost to be negative. Okay. Essentially, we want to make this cost as close as possible to zero. And this cost will be used to evaluate the performance of quality of the model on that asset. And so learning is what this is just optimizing of the cost, optimization of the cost. So what do I mean by that? Learning here, concretely, what it is, it means finding the theta hat, which is the minimizer. So it's the admin. If someone is not familiar with what the admin is, just tell me. Okay. And this, again, I could denote it similarly as, sorry, I used the notation with the, where is my theta? Okay. I used upper indices for the index of the data points. So let me be coherent. It's okay. So here, what is the operation I'm doing? I'm evaluating the cost for each pair of input and output. I'm evaluating the cost, which means some non-negative function, some notion of error between the true output, the true label, and the estimated one, or the predict. I prefer to use predicted one. Okay. So for each pair in my dataset of input and output, I'm computing the deviation, the cost between the true label and the one that predicts my model. I sum all these costs over all my data. Okay. This gives me a number. This number depends on theta, which means on the parameters of my model. And what I do is that I try to minimize this overall cost, this global cost over my parameters, which means I'm trying to find in my hypothesis class the function, the model that minimizes the overall cost. Okay. This is learning, completely unsupervised learning. Okay. Is that clear? So this has to be clear. If it's not, please ask, because otherwise it will become a bit difficult to follow from now. So it should be a summation over all the data. I say, because if you think of it, it's like, if I have less data, then the cost will be less. Yeah, but it does not matter what is the absolute value of this cost. You see, I can multiply by this cost by 10 billion. It does not matter. You can always rescale this cost in the way you want, because anyway, you're taking the minimum. And the minimum does not depend on the global scaling factor. Right? So what you're, okay, okay. I think what you are saying is that I think the answer to your hesitation is that it would be exactly the same if I average over the cost here. It's the same. So it should be the average. It is the same. If you take the average or not, it's just that if you don't take the average, this will be a bigger number, but still you are looking for the minimum. So we see that this one over n here, which means taking the average or just summing, does not change the value of the minimizer. Right? And because when we train the model, the n is fixed. The n is fixed. Yes. Okay. Thank you. n is maybe 10,000. Okay. So actually, yeah, maybe you don't want this cost to be too large. So you take a one over n. No problem. But it does not change the end value, the minimizer of this cost with any global factor here. I could take instead of one over n, I can take any scalar b, any rescaling factor. It does not change anything to this optimization problem. Okay. I have questions. You mentioned I remember that you mentioned a word deviation is that you are implying we calculate the deviation of the predicted one and the true label. Exactly. For almost every, I mean, I mean, algorithm, we also do the same. Yes, yes, yes. Whatever is the algorithm you want to think of, if you are doing linear regression that we will discuss at length or much more advanced machine learning techniques such that neural networks, the problem will be always in this form in the sense that you will define the hypothesis class would be linear models which are simple or very rich neural networks with very fancy architectures. It does not matter. This defines a hypothesis class. You have data. Then what you will do in any case is define this cost function and evaluate the cost over each pairs of training data that is inputs and outputs that you have access to like this and trying to find the parameters that minimize this cost. Okay. Then of course, all the difficulty, one obvious difficulty is that this operation here is not innocent. Finding this minimum can be extremely hard. Okay. Or very computationally demanding or it's not even clear in some cases that you can't find a minimum because this cost maybe it's a non convex function. It can be a very complicated function discussed. Okay. It can depend extremely in a very complicated way on theta and theta may be a very high dimensional object. Okay. So finding the minimum of this cost is a hard problem in general. Sometimes not, sometimes yes. If the cost is convex, at least you know it has a single minimum. It doesn't mean that you will find it efficiently, but at least you know the minimum is unique. If this cost is non convex, you're not even sure that there is a single minimum and it's not even clear how to find a good minimum. And what does that mean? A good minimum. Okay. So this is not the end. This is not the end. Okay. But learning will always take this form. Okay. You have a model. You define a cost function. You find the parameters that minimize the cost. You try to find the parameters that minimize the cost. Okay. So concretely, I also have a question about the parameter dimension. Set it here or do we also learn how to set M? M, how to set M? I mean, how do you choose M? Yeah. That's a very good question. And we will try to, like the next two, at least two courses will be dedicated to that. To understand that choosing M is not an obvious question at all. So it means, because from what I understand, from what you said, it seems that in machine learning, you don't care much on the parsimony of parameters. No. It depends. Maybe you want your model to be interpretable. In this case, you want maybe to enforce parsimony. And we will see how to do that. Okay. But not always. Okay. But okay, here you are rising a very good point. It's the fact that, okay, you could tell me, okay, Jean, that's great. You just gave us what machine learning is about. But then I could always take the richer possible hypothesis class. So, and just try to minimize in this extremely rich hypothesis class, a kind of universal hypothesis class, and whatever application it should work because my model is extremely rich. I just need to find the best one in this super rich class. Okay. But we will see that there is a trade-off. If you don't have enough data, if you don't have enough data, it might not be a very good idea to have a true rich hypothesis class. And this is all the issue behind machine learning. It's to find also a good hypothesis class given the amount of data and the quality of data that you have access to. Okay. And we will see, we'll discuss that at length. Okay. So let me say a last thing and then I'm done for today. So concretely, this cost could be what? And this is the one we mostly restrict to in this course, but we will also talk about other costs. But for example, in regressions, particularly when the variables, the labels that you have access to are continuous, the usual cost that we take is the square function. Okay. So in this case, the cost is called the mean square. So if you take the cost between y and y hat to be the square function, sorry. Okay. Or then you see that this average over the data points cost is naturally called the mean square error or MSC. Okay. It quantifies the average over the data points square deviation between the true labels and the predicted ones. Okay. And learning would mean, in this case, minimizing the mean square. Okay. So just a remark at some point, I will go a bit more into this remark, but if you are using a mean square law implicitly, what you are assuming from a Bayesian point of view, and we will discuss that from a probabilistic point of view, what you are assuming is that your labels, true labels are corrupted by Gaussian noise in this case. Okay. So the underlying assumption behind using a mean square law is that you have some noise in your data and this noise is Gaussian. Okay. And so why do we, why is that like a standard choice to use the mean square law, even if we don't really know if noise is Gaussian or not, that might not be the case at all. Because if you, you know, in many complex phenomena, errors are the accumulation of different sources of errors. They are like the total error that you, that you, that you, that is summed to your data that is, that corrects your data is coming from many stochastic sources of different types of errors, which are not necessarily Gaussian. But when you sum many different random variables, when many different noises, if you want, what you get at the end is something Gaussian. This is the central limit theorem, right? So it's, it's natural to take, to take Gaussian. So there is a question. Could you please explain the use of R2, R square in addition to the MSC? So I'm not sure what you mean by R2 in this case. What is R2, R square? No, really, if you want to just ask the question out loud, it's, yes, R squared, but I don't understand what you mean by R squared. I don't know what this one, what is the R squared error? Okay, we'll discuss later. Okay. Yes. So the meaning of the MSC, so concretely, the MSC, the MSC between, let's say, the vector of true labels and the vector of outputs by, by your model, which depends on the matrix of inputs. Now I'm writing things in matrix form. Okay. Is 1 over n, okay, the square deviation between the true labels minus your predictions. Okay. This is just the MSC. Or concretely, this means 1 over n sum of i from 1 to n of y i minus y theta x i. Okay. All right. Have a question. Yes. Why some people use 1 over 2 in mean square error? It doesn't change anything. This is just, this is just a convention. You don't care. I mean, like I said, here you can put, you can put whatever constant you want. The optimization problem is unchanged. If you minimize, if you minimize a function or a scalar times a function, the minimum, the arg mean is unchanged. Okay. The position of the minimizer is unchanged. So why people put 1 over 2? I think it's because when we derive back propagation, which are the update rules that allow to update the weights in neural networks, this 1 over 2 is just convenient in the equations. It simplifies with a true that appears somewhere because you see that if you have a square, at some point we will need to if to really find a minimizer of this function. And so how do we find minimizers? When we, when we don't have any convexity, when we don't know anything about the function, we just look for extrema, which means points. And by points here, I mean vectors theta, such that the gradient of this function cancels, right? Yeah. Right. And so you see that if you take the gradient of the mean square or you have a square, by taking the gradient will have a two that will pop out at some point. And these two will just simplify with the 1 over 2 that you put by convenience. That's it. So there is nothing deep about this 1 over 2. Okay. Okay. Thank you. You could put 1 over pi and you would get the same result. Okay. All right. So I think I'm done for today. So I'm much slower than expected. I hope that that the paste is not too fast nor too slow. If if it is one of the two, please, you can tell me and I will try to adjust at least this first course. I think it's important that we set up things, you know, clearly. So yeah, tomorrow I will discuss we'll continue to discuss the learning procedures here, what we want to do, what we'll call the machine learning workflow, which means once you have these three ingredients, which are again the data set. Okay. Once we have a data set, which is put in the correct form like this, we have a parametrized model or equivalently an hypothesis class. And once we have a cost function, okay, how to put all these three ingredients together in order to learn a good predictive model. And I will explain what good means. Okay. What do we mean by good predictive model and how to get it? Okay. Because as we will see, it is actually not enough to just minimize the cost. And we will try to understand why it is not enough. Okay. And what are the bad things that can happen if you just minimize costs like this closing the eyes and hope for the best. Okay. Okay. So I advise you to look at the first notebook that I sent you. If some of you do not have it, I can resend it. No problem. Just send me a mail. And it's very important that you actually play a bit with it at first because, yeah, you will see there are many surprises going on. And I think having the first book may help you to understand what we'll say tomorrow. And what we'll say tomorrow is absolutely crucial for the rest. And tomorrow we'll discuss why machine learning is difficult. Why it's a complicated task to design a good predictive model. Okay. Because here, you see what we're doing here is that we are fitting data. We are trying to find a model able to fit the labels in our data. So right now we are solving a fitting problem. But this is not what we want to do. We don't want to fit data. We want to design a model which is able to predict on new data. And tomorrow we'll discuss why the fitting problem and the prediction problem are very different problems. Okay. So look at the notebook and I will not teach Python in this course. So for those who do not know at all Python, I'm sorry about that. But you can anyway, you know, learn the very basics of Jupyter notebooks and play with it because essentially the notebooks are completely written. You don't have to do much. You have just to change one or two parameters in the experiments to see the output plots, the final results and interpret them. But anyway, I advise all of you that don't know Python to learn about it because there is no real point of doing machine learning without at some point coding some experiments. And now all machine learning is done in Python. So it will be useful for you at some point in your career. I can promise you that. All right. So tomorrow the meeting is that I don't remember. If someone remembers. It's at 4pm. At 4pm. Yes, that's correct. Thanks. So tomorrow at 4pm, we meet again. Okay. All right. See you, everyone. Bye. Bye. Thank you so much. Thank you. Thank you. Thank you. Bye.