 Okay, welcome back everyone, my name is Altele Ciorani from SKP and it's a great pleasure that I'm sharing today's session starting with a lecture by Professor Pankaj Mehta from Boston University. So Pankaj probably said your lecture will be interactive, so brace yourself. Thank you so much for inviting me. Please ask me questions. Please stop. I'm going to try to force you to do something. Let's see how well it goes. So let me start with my first interaction. So how many of you have written a deep neural network before? Oh good, then this lecture will be good for you guys. I was afraid that they'd be too basic. But actually, so the goal of the first three lectures is to get you used to opening up a Python notebook and being able to write a deep learning model. And so I'm going to walk you through the first three lectures that are about understanding it. And then the last lecture I'm going to try to tell you some new research that we've been doing, which is this very mysterious thing about deep learning networks, which is that they work despite the fact they have so many parameters that you would think naively they shouldn't be able to work at all. So we've been trying to understand that. And so that's going to be the last lecture. So before I go on, I mostly don't do machine learning. I don't like machine learning actually. But I feel like to be a good critic of machine learning, you have to understand it well. So most of what I've learned about machine learning has been about teaching people about, trying to understand it, and then teaching people, because I think there's a lot of hype, a lot of crazy things people say about machine learning that aren't true. So my hope is by showing you its most basic, simple form, you'll also be in a position to evaluate in a critical way what it can do and what it can't do. So machine learning is one of the most exciting things going on because it's really incredible at some things. But what's surprising is that people, it's really good at A, and people say that it's really good at B. And B and A aren't exactly the same. And by the end of this talk, when we go to the end of the four lectures, hopefully you can understand what I mean instead of in such an abstract language. So mostly I work on biological physics. And if you want to talk about, there's a lot of people talking about things like this, non-equilibrium STAT-MAC and how cells process information here. So we spent a lot of time on that. We think about ecology. We think about collective behavior. We think a lot about gene networks. So if you're interested in any of those topics, come find me. I love chatting with students. I love chatting with actually all scientists. It's one of the great pleasures of this job. And in machine learning we've done a lot of stuff, and most of what I'm going to be doing today is a small section of this very long review I wrote based on a class I've been teaching at Boston University for almost six, seven years. And every year I try not to teach it, and every year the students bully me into teaching it again. Because everyone wants to know what machine learning is. But it's basically my attempt to get you to the point where if you read this review and do the notebooks, you're going to be able to open up any paper you want in a machine learning research journal, ICML, in the big conferences and you'll be able to understand it. And I think you can even write the code if you work a little bit hard to reproduce these things. So I'm going to give you a flavor of that. The first three lectures are based on this. Like basically a very crash course selective topics of this review. And then in terms of physics, we've been thinking a lot about reinforcement learning and quantum dynamics. We spend a lot of time trying to think about why and when machine learning works. So if you're interested in any of these topics, come bother me. I like talking to people. Please don't make me sit at this school for three days and no one talks to me. That's the most boring thing that can happen. So if you have anything, you don't have to have anything intelligent. You just want to chat with me. Please come. I think the point, especially after COVID, one of the things I've learned is it's nice to be in a place and talk to people instead of being on Zoom where every conversation has to have a purpose. And that takes all the fun out of humanity, right? So everything is always with a purpose. All right. So that's what I wanted to do. So as I said, the structure of the talk is I thought it might be too basic, but since only one person has actually written a deep network here, the goal is going to be we're going to open up Python notebooks, and by, I guess, Friday morning, you're going to write your first deep learning model. It's going to be very easy, but we have to get you there. So today's lecture is going to be about a little bit more conceptual. We're going to play, and we're going to try to understand what is different about machine learning. I very hesitantly called it artificial intelligence because I don't know we could talk about it, but I will use machine learning throughout these talks. And the next two lectures after that are going to be basically to get you the point where you can write some code because it's going to turn out to write the most basic deep learning models you can write them in six lines of Python code. But you have to understand the basic ingredients that go in. So we're going to spend the next two lectures basically trying to understand the basic ingredients and then trying to explain to you what this new deep learning revolution, what's happened in the last 10 years that has made a difference. And the key has been this idea of fully differentiable models, and you'll see what that means. And it's really, the paradigm for making things work is fully differentiable models and you'll understand what that means at the end of all this thing. And then the last thing is going to be trying to tell you some research that we're pretty excited about. We've done a lot of years trying to understand a very simple question in machine learning about why it seems to work even though machine learning models have many more parameters than the amount of data points you're training them on. And naively classical statistics would tell you that it shouldn't work at all. And there's a lot of research on that. We found the research to be very mathematical and lacking intuition. So we redid a bunch of what are called cavity and very complicated spin glass calculates inspired calculations, but I'm just going to tell you the results and the intuitions that result from that. So that's going to be the last part. All right. So I'm going to now do this bad thing where I start with this first lecture. Okay. So how many of you are able to open up Google Colab? Did any of you, did you read my email? Did you guys try to open up notebook one? All right. So we're going to spend most of, I have until 10, 15, what, I have an hour and 15 minutes there, right? Okay. 10, 15. Okay. Good. So what we're going to start is let me unplug this and do this horrible thing that I'm not supposed to do, but I'm going to do it anyway. And write some stuff on the board. So I'm going to try to go back and forth between the board and these things. All right. So let's, let's, let's start with some basic, wow, these are interesting chalk holders. I like it. I mean, I was so grateful when the Koreans took over manufacturing this chalk and the price dropped by a factor of like 10. It was the most amazing day, right? The Hagamora talk. I can't tell the difference. People claim they can tell the difference between the Japanese and the Korean chalk, but I can't tell it all. We have a box of the old ones. And sometimes I make people tell me which ones, which they can't tell. That's a lie. But anyway. So let's think about what's going on. So again, please stop me with questions. If you have questions, I like interruptions. It's very hard. The best thing about not giving this talk on Zoom is that people can interrupt you. They don't feel weird. So let's start with what's going on. So generally, you know, people throw around this idea of artificial intelligence. And if you ask people to define what it means. Okay, I'll ask you guys. Who wants to define what artificial intelligence is? What does it mean? I don't know. You're an institute on artificial intelligence. So maybe what does artificial intelligence mean? Anyone? All right. Apparently no one has an opinion in this room. Don't worry. I'll just stand up here until people start answering. I'm good at this. I know how to make interactions interactive. Yes. There's no long answer because there's no right answer. That's what I'll tell you. So who wants to tell me what they think of when they think of artificial intelligence? Okay, Antonio. It's an intelligence as a system which autonomous mostly and adaptively make decisions. Autonomous and adaptive decision making. Okay. Okay. And what's... All right. So this is one version of artificial intelligence. There's other versions of artificial intelligence. I mean, this is definitely some of the things people say. But there's also this idea that you can learn directly from your environment, which is what this adaptive decision making is. Antonio put a very narrow definition up. Whereas most people think about artificial intelligence as some general thing that is intelligence in the way a human being is. Right? And I'll try to argue for you by the end of the talk that this is probably misleading, that most of what people call artificial intelligence is actually just automation of a new set of tasks that we could never do before. So I just want you to keep this tension in mind. All right? There's a difference between intelligence and automation. And I want... And that's the tension that I'm going to try to walk you through as we learn this models. All right? So generally, there is a sub... People call it a subfield. I don't know when, around two... After I wrote this review, people decided that artificial intelligence was a big, broad goal. And that a subfield of it is called machine learning. All right? So this is all... All these words are changing. And the idea of machine learning was the idea that you learn automatically from data. All right? So the idea of machine learning is that I give you some data, and from it, I learn things. All right? I learn how to do some tasks. And generally, there's three kinds of... Machine learning can broadly be broken up into three different kinds of components. One is supervised learning. And that's what we're going to focus on today. In these lectures, this is the only thing I'm going to focus on. And the idea of supervised learning is that I give you some data, but I also give you some labels associated with the data. So I have data. I have labels. So the labels can be pictures. Is it a cat, or is it not a cat? It can be... I give you a big image, and I ask you, how do I predict which part of it... You know, which pixels correspond to an image I go? It can be a language translation model where I give you an input phrase in, say, Korean, and I give you an output phrase, which is in English. Right? But the important point is there's some input, and there's some output. And the goal is to learn directly from data how to predict from the output. Right? This is supervised learning, and this is what we're going to focus on in this lecture. But you should know that generically, there's other kinds of learning. Can you even see this board? Okay, I'll write it up here. Another major branch, and I would argue, is probably the more interesting branch, is unsupervised learning. In unsupervised learning, what you want to do is you have some data. You assume it's drawn from some very complicated data distribution, and generally, you want to be able to say something about the data and this distribution it's drawn from. All right? So it's a very ill-defined task. Right? So examples of unsupervised learning is, I don't know, you might have read in the news all these text-to-image generative models, things like that, where the goal is, again, or if you want to do dimensional reduction, you want to visualize stuff, you want to make predictions about stuff. But the thing here is that this task is much harder because you basically want to summarize this probability distribution in some way and draw examples from it. Right? So that is what, you know, say, the new phrase, these dally-to, these image-to-text models, what they're doing is they're saying, the set of images is some complicated probability distribution, then I'm going to give you some key words in English, and I want you to make draw samples from the set of all images with those key words. All right? So this is unsupervised learning, much harder, because here, what I want to point out is we have an easy way of saying how good we are. Right here, it's easy to say how good our model is because you just compare the predictions of the model you trained to the real predictions. Here it's much harder. Right? And there's been a lot of, there's a lot of progress in basically making this problem look like that problem through what's called self-supervision. That's been the big deal in the last three or four years. The last thing is what Antonio and a lot of physicists who do neuroscience are specialized in, which is reinforcement learning. And reinforcement learning is basically, you have an agent, it can interact with the environment. There's some environment. It interacts with the environment. It takes an action in that environment. And based on the action, it learns generally a policy, a way to take actions in an environment to maximize some reward function. So again, here, reinforcement function, there's a reward function, but the important point is that this is interactive. Right? So this is a lot like supervised learning in the sense that we know what we want to maximize, in practice, half of what's hard about reinforcement learning is constructing a good reward function. But then I want to learn what actions I should take. So I have some things I can do. I have a robot. It can move left. It can move right. And the actions depend on the state of the environment. And what's interesting is you take an action. The environment causes you to take an action, but the action itself can change the environment. And that's what makes the problem hard. So it's an interactive learning thing. So basically, there's these three kind of broad fields, and of course they mix between each other, but I think it's worth keeping these categories in mind. And I just want to emphasize to you, we're going to think about the very easiest problem there is, which is the supervised learning problem. And even here, you'll see that there's so much subtlety in what's going on, mostly because classically what we think about in physics especially is violated. Many of the intuitions we have no longer hold. And or we have the wrong intuitions. So the basic idea of what we're going to do is... Yes. Please. This is the classical symbolic AI. So to deduce from axioms automatically, I think that's the part. And which I think in the near future we'll have to be integrated into the full picture. I agree. So I think what Antonio was saying is that this looks a lot like this, whereas that requires you to make abstractions to symbolically represent stuff, to generalize. We'll see that the central question is what generalization means. The most interesting thing about Newton's laws, for example, is that they tell you how to work in crazy settings you would never imagine. The most interesting thing about Maxwell's equations is that you can plug in any boundary condition, any set of things you want, any dielectric concepts, you can make predictions in any setting. That requires a level of abstraction that its unclear modern artificial intelligence can do. Despite the fact you can have a conversation with it. So the biggest advocates would say that we're just getting our ways there, that abstraction and symbolic logic is just more and more complicated statistical learning, which is what we're doing. We're learning statistics from data. And I would say the skeptics like me, I've already told you that I'm a skeptic of machine learning, would say no, no, that's a completely different kind of way of thinking. And the truth is you probably do some mixture of both, but I would say it's a qualitatively different fixed point in the language of physics. So that is like a very broad general thing. And so now I'm going to, for the next, the remaining lectures, until maybe the last 10 minutes of the fourth lecture, we will come back to these ideas about what's difficult. I'm just going to focus on the simplest problem. And even here we'll see that there's a lot of mysteries. And what we'll do is, again, the lecture is trying to do two things at the same time to force you to think, but also give you very practical ideas. Like you should be able to open up, you shouldn't be scared to open up a Python notebook and write a neural network. That's what I want at the end of this thing. All right. Yeah. Is the result of computing is different from these three or there is a part of it? Reservoir computing is generally a form of, depending on what you want to do, but generally a form of supervised learning where you build a neural network where you have many, many features and you run a, usually it's a high dimensional recurrent neural network where you run a dynamical system and then you learn to predict something from it. So it's usually a form of supervised learning for time series. That's just a particular architecture and a particular problem. All right. But we're, you'll see that, my goal is that hopefully after we do this thing and maybe a little bit more work, you'll understand what the basic logic of all this stuff is, right? All right. So, oh, wrong way. That's not what I want to do. Okay. So let me erase. Oh, it's wet. Not good. Is there a non-wet one? So let's get down to the simplest version of the problem. All right. The problem of machine learning and supervised learning is to learn how to predict stuff from data, right? And predicting seems like the same thing as fitting. And what I want you to take away from the next, I don't know how many, whatever, 20, 55 minutes, is they're completely different things. Not completely different, but fundamentally different things. All right. So the basic problem is going to be we have a data set that we want to learn from, right? And generically, the data set we're going to denote by D and we're going to basically usually use a matrix like this, all right? So the data set is generally denoted by a matrix X and the matrix X has features, okay? And these are different data points. Data point one, data point two, so on, until I have whatever, however big my data set is. And the point is I can measure, you know, many, many, many, many different things. And for each data point, I measure whatever some NF features and there's M data points, right? And it's often useful for simple notation to just make it into a big matrix. All right. So I have my data set that I have my independent variables, right, which are my features. And then, of course, I have my dependent variables. This is a vector, right? So generally, Y is a big vector and each of those things are for each of the data points. It's a vector of the size of the number of data points I have, but it could also be a matrix if I have to predict many, many dimensions, right? And now the next thing I need is a model, right? So what we're doing really is statistical learning. So we need a model and the model is something that has, it takes an input feature, has some parameters, and makes a prediction for what Y is, all right? So a model is just take the input feature for each data point. So let me put an I here. Make a prediction and what I get to do is I get to change these parameters, right? So this is some concept everyone is, you know, set with. Then there is parameters, right? And then the last thing that's really important is a cost function. Did something go wrong? Okay. And this is, all right? And a cost function is something that tells me how well I am doing at my fits, right? So generally what you try to do is you have some data set, you try, you have a model, you want to fit, right? You have some parameters that you can tune and you tune those parameters to minimize the cost function. And so, right? So the central dilemma of all this stuff, right? And this is what I really, really hope that playing with these lectures today will make clear is that predicting, okay. So now who wants to give a guess on what predicting means and what fitting means? Come on, don't be shy. These aren't trick questions. Go ahead. And what's fitting? No, that, right. So I think it's already implicit in this, but this is for new data and this is just my data set that I've seen. So why would these things generically be different? We can think about. Let me ask a different question. When do you expect these things to be the same? All right? So I don't expect you to answer that. We'll come back and discuss it. That's what the notebook is about. The Python notebook. Okay. Yes, you'll find out that these things are related but not at all the same. All right? This is the central dilemma of all machine learning. And this, the idea that I see some data, but I want to predict on new data, whatever new data means, right, is the idea of generalization. And this is going to be the central dilemma of supervised learning. So in practice, can I pull this one now? Yes. Excellent. Oh, it's hiding behind the banner. Okay. It doesn't matter. In practice, the way we do this is that I don't, I only have the data I have seen. I don't have whatever it is. So what you generally do is take your data and you divide it. All right? Before you do anything to the data, no feature selection, don't clean the door, don't do anything to it. You kind of divide your data. You say, this is, and then this is the data, that training data is the data I fit things on. And the test data is going to be this new data. Right? And somehow this is going to be a proxy for data I haven't seen yet. All right? And the point is in general, right, what you get to minimize is some error, which is, you know, minimizing the cost function of the training data. But what you care about is actually, okay, I can't intermediately do these. Can you see the bottom of this board? Is this stable? Okay. That what you really care about, in some sense, is out of sample, which is on data that I haven't seen, which I'm going to usually use, the proxy for this out of sample area is going to be my error on my test training data set. Right? The proxy is this thing. But you see, what I get to play with is this. What I care about is that. And naively you would think they're almost the same thing. And so now what we're going to do is we're going to play around and see why they're not the same thing. All right. So now I'm going to put the screen back down. So this is just what I told you. The central dilemma is that I want to fit to this. But what I care about is not even that. This is a proxy for what I care about. This is not actually what I care about. I care about for unseen data. But my best guess at how I'll behave on unseen data is to take the test data and try to make predictions on that test data, which is some subset of the data I haven't done fitting on. And so we're going to play with a simple task. Which is we're going to do polynomial regression. And the basic idea is I'm going to generate data with polynomials plus noise. So I'm going to take data, my training data in this interval, 0 to 1. So random points in 0 to 1. And then the data I generate, the thing I want to predict is this yi. I have a stick. Good old physics institutes. It's the best thing about lecturing in physics places as opposed to biology places. But there's no sticks in biology. And they don't ask you questions. It's so aggravating. It took me like five years to learn not to ask a question in the middle of a talk. Because to me, that's how you respect someone. You show you're paying attention by asking a question. But in biology, that's offensive. You should just sit there quietly and play on your phone instead. Anyway, let's generate data with polynomials. So what we're going to do is the data set is going to be generated. And we're going to consider two different ways of generating the data. Either we're going to have a linear function or we can generate the data with a tenth order polynomial. Everyone understand what the setup of the problem is? That's how we generate the data sets. But the important point is there's noise here. So this noise is just independent for each data point. It's drawn from a Gaussian. With some standard deviation, you'll see in the code in a second. Then what we're going to do is we're going to fit with different polynomials. The whole idea is that I can fit the data with something else. I don't know what it is. And what we're going to do is we're going to fit the data with polynomials of different order, alpha, where alpha is a polynomial of order alpha. So I can have first order polynomials, which is a linear model. I can have second order polynomials, which is, you know, whatever, a plus bx plus cx squared. I can have third order polynomials. A plus bx plus cx squared plus dx cubed. So on and so on. All the way to tenth order polynomials. Everyone understand that? So we're going to generate data with these two polynomials. And then we're going to try to see how well we can fit slash predict data using different models that have different, what they're called, model complexities. Meaning they're a different number of... model complexities is a very subtle idea. But for our purposes, the more complex a model is, the more expressive it is, meaning the more complicated a data set it can express, it can fit in theory. And in naively scales with a number of parameters. And then what we're going to do is we're going to compare stuff in sample and out of sample. And I'm going to pull this little trick, because this is going to be an important thing, because we train on sub subset of the data, on zero to one, but we're going to try to predict on a bigger range. On a range we haven't quite seen yet. You see there's this extra from one to point two. And this is one form of generalization you can think about, which is predicting beyond the range of data you've seen. So to do this, we're going to basically... How many of you used Python before? Most of you. It's amazing. It's the last five years. It went from one hand to three quarters. So if you haven't used Python before, find a friend. Actually, in all cases I encourage you to sit next to someone and find a friend and do this with someone else. Don't do it alone. Just find a friend. It's good for you to meet people anyway. And what we're going to do is we're going to basically... No, to test data set is always completely different from training data set. Yes, but they're different data points. Always different data points. The point is that they're different data points. I'm sampling, so what I'm going to do is to generate this data, I randomly sample x between zero and one. And then to generate the test data, I randomly sample in this bigger range. So I get some test data points that fall outside the range. But the points here are independent of the points there. That's the fundamental point. Even when they're drawn from the same distribution, they're fundamentally different. Everyone see that? So let's open up, right? So hopefully you guys all could open up this Google Colab, right? So what I did is... Has everyone opened the notebook yet? What does everyone try? So the idea is you just go to Google Colab, right? I think that's the easiest way, unless you have a local Python installation. You go to Google Colab, it's here. Click on this thing. I've already logged in. If you haven't logged in, is everyone logged into Google Colab? Raise your hand if you're not, and then I'll wait for you. Okay? Then go to this GitHub here, right? And then copy this thing, which I'll put here, right? So if you put this GitHub, put this thing, which I also sent you by email so you can try to read it here. GitHub.com, backslash, emergent behaviors in biology, with dashes in between, backslash, ML review, underscore notebooks. All right? One side. Is everyone okay with this? I can make it bigger. I don't know how. Maybe I can do this. There you go. Now it's bigger. And now if I hit search, right? You should see all the notebooks pop up. And then go to notebook one, which is somewhere here. Notebook. ML is difficult. That's what it's called. So I click on this, and it should load up and look like this. And the basic idea is, again, is the task I told you. But the important point is we're going to ask how fitting and predicting, meaning training and test error. And we're not going to define it. We're just going to look at it visually. Depends on the number of data points that I train on and the amount of noise I have and the complexity of the model that I'm using. All right? So you'll see that, you know, so this is just what I told you, right? So this is how I generate the data. I predict with different polynomials of different order. And we have a training and test order thing, a test set, right? And what we're going to do is the cost function we're going to use is the mean square error, right? So what I have is I have some, this is the thing I generate the data with, f of xi. This is the thing I fit with, polynomials of different orders, g alpha, x alpha. And then we're going to consider the case where g alpha x can be three things. It can be either a linear model, so I'm fitting a linear model, all polynomials up to order three, so third order polynomials. How many fitting parameters does this have? Third order polynomials. Four. Everyone understand? And this is tenth order polynomials. How many fitting parameters does that have? Eleven. Eleven, okay. So I just want to make sure everyone understands what's going on, right? And we're generating either linear models or tenth order. Those two polynomials I showed you before. And what we're going to minimize, right? The way we fit is we minimize, we choose the parameters to minimize the difference between the prediction and predicted thing and the generated thing. Okay. So here is a series of tasks you have to go through, all right? So the first thing we're going to do is we're going to say let's generate the data with f of x equals 2x, so just linear data. And I just want you to go through these examples, okay? And the sigma is the amount of noise there is, and train is the number of training data points. So if I go to the code, what I'm going to change is I'm going to change the amount of training data points I'm training on and the amount of noise. And the code if you go through here to do that is just in here, right? So you can see that the only thing you really have to change is here, this thing. This tells you the number of training data points and train. Just change this to whatever number. And the sigma train tells you the amount of noise. It's the standard deviation of the noise, all right? And then what I do go through, right? So for the first one, right? It says, okay, generate this plot when n train equals 10 and sigma equals 0. So I just say, okay, let me turn this to 10. Let me set this to 0, right? And then I just hit control enter or hit this thing here that runs the cell. And what it's going to do is it's going to generate a plot like this. You have to run it anyway. You have to trust that I'm not going to corrupt your computer. And so I do this and it just generates a plot like this, right? At the bottom of this. And the point is that you see that if I have 10 training data points, there's 10 points, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10. There should be three curves but they're sitting right on top of each other because basically it can fit this perfectly, right? There's going to be a orange curve which is that my prediction is from the linear model. This is my fit for my linear model. This is from my third order polynomial and my tenth order polynomial. And then the point is that's that cell. And if I go to the cell below, what I can do here is these are test number of test data points. So this is like how many new points I'm going to make a prediction on. We're not going to generally fit with that. So it's going to draw 20 new points, right? And I'm just going to plot it and it's going to predict the, it's going to tell you the predictions for 20 new points. So if I do here, I hit this thing for the test. And here you go. So here's 20 new points. You see these are 20 blue dots on the interval 0 to 1.2 and this is what the three models predict here. And you see that they predict perfectly, right? All three models behave exactly the same if I have no noise, even on a little bit of training data, all right? And so what I want you to do is basically spend half an hour, right? Just going through these. So what I want to do is how much time do I have? I'm done at what time? 10.30? No. 10.45, right? OK. So what we're going to spend is we're going to spend 20 minutes. I want you to find a partner and spend 20 minutes going through these exercises. All right? It's just that simple. Think about what you learn. And 20 minutes. And I'm just going to wait here for 20 minutes and you're going to play. Don't do it by yourself. Please find a friend. Ideally, someone you don't know. Things? This... Actually, I have no time for it. I'll check in again in 10, 15 minutes and see where you guys are. All right? And if you get done with this first set of things, this fitting versus predicting when the data is in the model class, please go to the next one too, which is fitting versus predicting when the data is not in the model class. You can just call me over if you get through the first part and you can move on to the second part. Let me just ask quickly for an update. How far have people gotten? Have people gone on to the second thing? Or have people gotten through these? Or yes, no? OK. All right, then maybe we can discuss what we see. Or you know what? I have until 40. For the last five minutes, why don't you guys play with this as well? And here what we're going to do is we're going to change where we generate the data. Did you guys already play with this second part too? You see, all you have to do is comment out. If for those of you who haven't, I'll give you five more minutes. All you have to do is to generate the data from a different thing. You take this thing. You just comment this out to generate from the tenth order polynomial. You commented out the linear case by putting a thing. And you uncommit this thing. And now you can generate from a tenth order polynomial. OK. For those of you who haven't done, so do the last five minutes and then we'll spend the last 15 minutes discussing what we learned. All right. You guys can keep on playing with this, but maybe we can discuss. Let me show you what you were supposed to see and make sure that we get these things. So here is the kind of graphs you get in the noiseless case. And then we can discuss what it all means in a second. In the noiseless case, this is where I generated the data with a linear model. This is the training. This is the test. And you see all the polynomials work great. And now, of course, I can also generate with tenth order. And what you can see is that in the noiseless case, of course, the tenth order polynomial can fit all the training data. But the first order and the third order can't. And then out of sample, again, you see that the tenth order polynomial works perfectly. And the first and third order, obviously, they do OK, but not super great. So when there's no noise, naively, that doesn't seem to be any problem. But of course, as soon as I put some noise in, this is the kind of stuff I get. So what's surprising is here is the linear model. And you see the tenth order polynomial has a lot more wiggles. If this was blown up, you see the yellow polynomial, the third order also has some wiggles on the training data set. And then on the test data set, yeah, they all do OK here. But then the tenth order thing just does horrible, especially in this region where I haven't seen the data yet. And then I can do the same thing with the tenth order polynomial. And you see, OK, they all do OK, right? But what's surprising if you go through here is you'll see even on the test set, actually the third order polynomial, if you look at it closely, it actually does slightly better. All right? So this is the kind of thing you were trying to see, right? So this is what's going on. And here is 1,000 training data points. And you see that this is the best fits, whatever it is. But here's the most surprising part. I have a sample, right? Even though I generate things with tenth order things in this region that I haven't seen before, even though none of the polynomials do particularly well, the tenth order polynomial is the worst. It actually gives you qualitatively the wrong answer. It qualitatively tells you things are going to go down when they go up. Yeah, sometimes it does, sometimes it doesn't. That's an important point. Many times, right, depending on what's going on. But for this many things, it won't change that much, that many data points. If I have 10, yes. So now the question is, what have we learned from all this stuff? I have 13 minutes. So we'll spend five, six minutes of it discussing collectively. And then we'll spend the last eight minutes. I'm going to show you three graphs that basically summarize this in something called the bias variance tradeoff. All right. So what have we learned? Why is fitting not predicting? What's the fundamental problem? Overfitting is a word for it. That's true. So there's two basic problems. There's just two classes. You can basically summarize why you don't make good predictions on two basic things. Why can you overfit? Oh, come on. Don't mumble. One of you would be brave. Bad choice of cost function. So you think if we had not used square error, it would do better? Maybe. Cost function is one thing. For sure. That's interesting. Who else has other ideas? So you think there's a version of a cost function that would make this better? OK. Yeah, but why is the number of data points matter? Unimmentally, why is the number of data points or the cost function matter? Right? I mean, I think the cost function is part of a bigger problem. It is a subset of two. There's two categories. This belongs to one of the categories. This belongs to another category. Is that what? Prepared enough? OK. So yes, it's true that the training and test set don't match. That's another thing, training and test. But even in the region where they match, I don't think it's like really great predictions. Training and test, right? That don't match. But if I step back, even bigger, just really big. Just think as abstractly as you can. What's the fundamental problem? There's a specific region of the features. The data has some features, for example, the region 1 to 1.2. The features were not present in the training data. That's right. The training and test data don't match. That's fundamentally what's going on. So I agree with what you guys are saying, but let's step back. The fundamental problem is that I want to learn a probability distribution, but I have to learn it from samples. There's noise in the training data. So why is that? Yes, it learns the noise. There's noise in the data, and there's noise behaves differently. I agree with all this stuff. So let's try to lump it into two big problems. I'm glad you guys are thinking about it, but we can step back. And it's worth lumping into two problems. One problem is generally the fact that I have to learn from data. And whenever I learn from data, there's sampling. I don't get access to the probability distribution itself. I get samples from the probability distribution. It's a survey. So there's always sampling noise. And because there's sampling noise, the training set and the test data can be different from each other. So there's always going to be a mismatch from the fact that I never get to see the full data distribution, except for in pathological cases. So the fact that I always have to learn from finite amounts of data means there's sampling noise. So that's one source of error that I have to learn from finite samples. The second general source of error is that my model might not be expressive enough to capture all these things. The four shoulder polynomial can't capture complicated relationships. And basically, in supervised learning, there's basically just these two tensions. One, that I have to learn from finite samples. And two, my model has to be expressive enough to capture all the relationship, but there's a trade-off between the two. More complicated models need more data to learn. We'll come back to the nature of this trade-off in deep learning models for the fourth lecture. But right now I'm going to focus on classical statistics and intuition, which is still helpful to have. So you have finite amounts of sample, but it's clear that if I want to learn a more complicated model, I need more data to learn it. So there's a trade-off between these things. And these two classes of things basically go under two names. And these are the two fundamental intuitions. One is bias, and the other is variance. Bias is about how complicated a complication of a model. My model might not be complicated enough to understand everything that's going on. The more complicated the data distribution is, the more expressive I need a model. The other thing is that there's variance. This is because I learned from finite amount of data. This is the fact that I sampled the data. And of course, the fundamental thing is there's a trade-off between these two. More complicated things need more data. So if I use too complicated a model, I won't have enough data to constrain it. And there's this tension of how I minimize both. And so the whole point is, the lesson is fitting is not predicting. Complex models can lead to overfitting. I don't have enough data to learn everything well. Models that are too simple can underfit because they can't express the relationships in the data. And the final thing I want to point out that's just never emphasized enough is that it's very difficult to generalize beyond what you've seen in the actual data. This is just true of everything. And I think that is, to me, a fundamental difficulty of statistical learning that one hopes abstraction and general logic can get you beyond. That's where I personally, and I think most physicists would agree with me and most cognitive scientists until recently. Until now where you can get money and grants for saying you're doing artificial intelligence, most people would agree who are not pure statistical learning people is that you might need something else. So it's worth summarizing these things in three simple graphs. Imagine I fix the model complexity and I increase the number of data points. What's going to happen is as I increase the number of data points, thinking about polynomial regression, my in-sample error on my training data kind of keeps increasing. Because if I had 10 points, I could fit them exactly with a tenth order of polynomial. We saw that because you just draw them. And then as you increase the number of data points, I can't fit it so my training error goes back up. But my out-of-sample error goes down. Because as I have more data, I learn better. I sample the probability, the real probability distribution better. Everyone understand? Right? And then basically you can, eventually, if I have enough data points, I've sampled the probability distribution enough they should converge to the same point. E in is E out. And I can always define my out-of-sample error into two parts. There's a bias which is basically fundamentally error due to the fact that my model can't represent stuff. There's also a noise here that I haven't put in here. If those of you are thinking there's some fundamental noise I can't predict about. There's noise, bias, and then variance. Variance is the part of my error due to the fact that I have a finite number of samples. Right? As I get infinite amount of data, that variance goes to zero is the thought. All right. So this is holding the model fixed changing the number of data points. We can also think about now fixing the number of data points and making increasingly complicated models. So here you can think about, like, what polynomial I'm going, right? So this is more complicated models to the right. And the point is that the variance goes up because as I make the model more and more complicated, I need more and more data to sample it well. So my effect due to finite sampling becomes more dramatic. But as I make the model more and more complicated, my bias goes down because I can express more and more complicated relationships. So classical statistics, right? In classical statistics, which is what I'm showing you, and the point of the fourth lecture is going to show you that this picture is fundamentally incomplete, is that what you like to do is you like to find an optimum that balances bias and variance. Right? Making a more and more complicated model always makes the error go down. But I need more and more data. So depending on the amount of data I have, there's an optimum that balances these two sources of error, and that's what I usually call my optimal model complexity. And we saw that in this thing. So there's another way of looking at it, is that often it turns out that with a finite amount of data, it's better to actually use a model that's biased. So like in the thing we saw that the tenth order model didn't always work best, even if the data was generated by the tenth order polynomial. And the basic way you can think about it is this kind of basic picture. So imagine I'm in the space of abstract models. Right? And then each data point here is a different data set. So I give you a data set of some size, I fit the parameters, and I plot what the parameters are, say tenth order polynomial. So I fit in this ten dimensional space, I tell you what are the fitted parameters. And I use a different data set, I plot this, I plot this, I plot this. And because this model is so complicated, every time I fit the data, I get slightly different parameters. There's going to be a spread in the parameters. That's this kind of bullseye here. It's like this Gaussian here. And you can see that this, you know, if I had an infinite amount of data, this would be the true model, and they would just converge on it. But because I have a finite amount, I get this kind of giant spread. Right? Any time I fit, I'm going to get this. Now I can consider a different model, which has more variance, much easier to fit. So the spread is much smaller. But even in the infinite data, it would actually not get to the true model. But you see that for this amount of data, this is going to work better than that model. Because the spread is too large, at least. Right? That's the basic intuition. So that's basically what is going on. There's these two sources of errors, and that's what makes predicting hard. First, I need to make the model complicated enough that I can capture the real relationship. But how complicated I make the relationship depends on the amount of data I have. And it's throwing, balancing these two things that is really tricky about supervised learning. And generalization. Right? It's because generalization is about predicting stuff for data I haven't seen. And so you always have to balance these things. So I think that's my time for today. Right? And tomorrow we'll move on from this general idea, keep this in mind, and go to more practical things, but understanding how we practically fit stuff to get to the deep end on, I think I only have one lecture tomorrow. I don't remember. But I think I only have one lecture tomorrow. And then on Friday, we're going to write our first deep learning model. So those of you who haven't done it, we're going to do it with the simplest package there is to write a neural network, because this, if you guys have time, you might want to play around with it. It's called Keras, so you can go look at it. So we're going to write some, and the goal is going to be to train a simple neural network by lecture three. Right? And then on lecture four, we'll come back to all the concepts we talked about today, and talk about why they're fundamentally, there's like a revolution in our statistics understanding, a fundamental revolution in our understanding of this stuff in the last four years. And I'll try to explain to you our best understanding of that. And that's going to be much at a much higher level. It's going to be much more of a research, that's a research talk I'm going around and giving. So if you can't follow everything in lecture four, hopefully the first three lectures are useful for you. All right, sorry, I went over. I think it's break time.