 Thank you very much. Okay, hi Pratik. Antonio is in today, so I'm kind of stepping in, chairing, but I think the may see some struggles coming in, but I guess we start now. And we've got the, you may see we have the captions, automatic captioning system on, so okay. It was requested before. Good. Certain lecture, deep learning. Thank you very much. Thank you. Good morning, everyone. Good afternoon, everyone. So what I want to do today is we have had a very, very brisk introduction to deep learning. We'll finish that best introduction in the first few minutes, and then I will give you a flavor of the ideas that we are working on over the last year or so in this long winded attempt at understanding deep learning. So this will be more of a research talk than a lecture. Maybe the first 15 minutes will be a lecture. Before we get to this, let me try to do a quick recap of the previous lecture. We have essentially in the last lecture, we talked about neural architectures. We began with what are called fully connected layers. These are networks where the matrix of weights that creates the features at every layer is a dense matrix. So there is no specific structure to this matrix, and then you stack a few of them, these up to create your labels. The issue that we said about these networks was that they tend to be rather big and depends on the kind of data. So if you have low dimensional data... I think we can't see what you're writing. Oh, oops, sorry. You're not sharing your screen yet. Yes, yes, I forgot. Sorry. Thank you. Now you should be focusing. So the issue with fully connected networks is that if you have data that is high dimension, then the networks will be required to have lots and lots of parameters. So there is never things to use. If you have data that is low dimensional, let's say 1500 dimensional data, and then you can process this data pretty well using these models. And this is also how people will use this in practice. But for images, the amount of signal in images is much, much smaller than the amount of pixels. There's lots and lots of pixels in images. As I said, and a number that I like to quote pretty often, the number of pixels in a typical image taken by a smartphone camera today is 50 megapixels. This is a ludicrously large number. If you even think of running a fully connected, one layer fully connected network that predicts 1000 classes, this will be a perceptron for this data. It will itself have 50 billion weights, 50 million times 1000. And that tells you that such dense networks are no good when handling highly redundant data like images or text or sounds, etc. Convolutions, the way I introduced them are a way of reducing the size of the network or with the understanding that images have some local structure, things have similar patterns locally. But then at the global scale, images will look slightly different. So to capture such an understanding, such data, convolutions are well suited because they have a receptive field and you're only doing computations within the receptive field of the convolutional kernel. If this is your signal, this is your kernel, then this is the receptive field of the kernel. And those are the only places where you're summing up the signal X. We talked about how to stack up layers of convolutions, just like the perceptron is the building block of a neural network, a convolutional layer is a building block of a convolutional neural network. And then the basic data doesn't change. You have a convolutional layer and after that, you apply the nonlinearity instead of S transpose X, you have some W convolved with X. And we saw how you can use multiple channels of X to convolve the independently so on and so forth. Convolutions are nice because they allow us to build translational equivariance in images. If the cow in this image was at some slightly different location, let's say here, the convolutional filter would still give you a large value or at this location, instead of giving a large value at this location. The features are transformed in the same way that the inputs are transformed. This is nice because you can detect objects in different parts of images. You can detect words in different parts of sentences, you can detect sounds in different parts of long sequences of sounds. But it is not so useful if you want to do something with this detection. In order to call this cow a cow, I need one number, which is my output Y hat. I need to make sure that the output Y hat doesn't change even if the cow moves around in this image. And that requires some Y hat to be invariant to the location of the cow or translation of the cow. So while convolutions allow us to handle things that move in the image plane or at least notice that they move in the image plane, we don't get to use them directly because we are simply multiplying all these features by this matrix or vector V. So depending on what V is, some of this will be invariant, some of this will not be invariant. And so max pooling is an operation that is much like convolution. It also works in a small window. It simply takes the largest value of a pixel within that window and writes that as the output of that. So max pooling is a useful layer because it just destroys a lot of information. If you do a two cross two max pooling, the size of the image will go down by a factor of four. And this is how people will process very large images sometimes. Of course, it also destroys a lot of information. So if you're taking local maxima in your image, then you are necessarily removing a very large amount of small scale structure. And in some cases, this might be reasonable for identifying simple things like this, this might be reasonable. But if you want to read the number on this vehicle, for instance, this is not a good idea. So typical conversion networks are built using a combination of convolutional layers and pooling layers with different strides, convolutional layers and non-linearities. And you will train them with stochastic gradient descent after checking the derivative with back propagation. Alongside it, you would perform augmentation to get rid of noises that are not so easy like translations. And this is just simply involves you taking an image and then creating copies of that image in your training dataset with the same labels. And then when you feed the image to the network, it is forced to classify objects like this as also, let's say a car and a house. These kinds of objects are unusual ones. So it is wise to not augment data with that. Okay. Any questions before we begin? Okay. Yeah. So now let us talk about how to, okay. There's a question. Go for it. Hello. Yes. So I think yesterday you mentioned that a few of the features that are inferred in the first layer are quite similar. So can we start with that layer being informed in the sense we don't actually have to vary that layer in the sense those features, if they are quite universal, can we just use them? Yes. Yes. This is a very nice question. And many people have attempted to do this. It tends to work quite well in some cases. What you're talking about is if I know that most of the world looks like this, the small scale, why should I bother learning these features? That makes total sense. Now there is two ways to think about this. So I'll give you one example. So there is a sub area called transfer learning where people are interested in precisely playing this game. They will build a model on one dataset. Let us call it the source dataset. Let's say this is a dataset of animals that you find in nature. So cats, dogs, giraffes, elephants, etc. And you would like to adapt this model to some new dataset. Let us say this is vehicles. Now from the outside, there is not that much that is common between different vehicles and different animals. But you know that the low level features are presumably the same. So the way, the reason people do transfer learning is that they do not train the lower layers after training the model on the source, but only train the last few layers on the majors of the destination dataset, which is the vehicle's task. So in this way, if you are the parameters you train, the number of samples you need, and that is why you can get away with slightly fewer samples of vehicles instead of training it from scratch. And think of it as a great initialization for the weights of a network. Instead of randomly initializing the weights, you initialize the weights using something that is trained on some other dataset, presumably related dataset. Now the efficacy of this depends on how different these two datasets are. For instance, if you have RGB images on the left-hand side, which is how these features were created. And if you have MRI images on the right-hand side, which are grayscale images and 3D images to boot, you may not get very good transfer learning. And that simply says that the low level features of MRI images and RGB images are quite different. Obviously, there are other situations when both of these are RGB datasets and they are related tasks. So incidentally, I work a lot on understanding such distances between tasks. And then we can say, look, these are the situations in which you can safely do the transfer. These are situations in which you shouldn't do the transfer or you should do something more clever. All of these are techniques to utilize this shared information. People have also tried to learn low level features using more, even when you train on one dataset, let's say D, you could try to somehow analytically find these low level features. There are what are called Gabor filters. Gabor filters were first noticed in the V1 processing region of the brain, the early visual cortex. And they look exactly like this. They are ellipses, essentially have different frequencies and different angles. And people will try to set the initial few layers or the first two layers directly using the analytical formula of the Gabor filters. This doesn't quite work that well, because it is the number of features that you can create is really humongous, even more than the number of features that these typical convolution networks have. And so you will be required to create a very, very, very fat first layer, and it will basically mitigate the entire point of having these features pre-learned. If you fit them only on the training dataset, then you may get away with this smaller first layer. So this works clearly in some cases. If you kind of be a little more strict about it and say, I want one dataset and I want to analytically find some of these features, because I know how they look, it is if people have not had too much success building it. You can check a paper called Scattering Transform Networks. It is by this professor called Stefan Malla at ENS in Paris. And he tried a lot to build models that are analytically pre-trained. So let us look at one very simple way of training the network. We said that if there are M classes, let's say M is 10 per 10 digits, the output Y that you are trying to find is one out of these 10 classes. We also said that the last layer of the network, V could be a matrix of the number of features. So if the features belong live in P-dimensional Euclidean space, then V could be a matrix which is of size P cross M. Now, the way people fit typical classification machines is as follows. The label of every image is one number between one to M. They will first convert this label into something called as a one-hot encoding, which is simply the white row of the identity matrix. So if you define what are the elements of your matrix, so let's say this, I said digits, let us write digits. These are digits from 0 to 9. A one-hot of Y equal to 2 will be 0, 0, 1, 0, and 0. So it will have a 1 at the location 2, and then it will have 0s everywhere else. Similarly, Y equal to 4 will have location 1 at location 4 and 0s everywhere else. This is just a construction, nothing very important about it. The one thing to notice is that this is a probability distribution. It sums up to 1. So in some sense, the true label, the one-hot label, is the true likelihood that the annotator assigns to this image being a 2, or the annotator assigns to this image being a 4. And our job for the network, the loss for the network, we'll simply check how far away from this probability distribution you are. We think and we interpret the output of a model as a probability distribution. This is an interpretation. It is simply an interpretation for us to use it more effectively. So if Y hat is a vector in 10 dimensions, every entry of Y hat, I will simply interpret it as the, or I would like to interpret it as the probability of this image being of a digit k. And what we're going to do next is a construction that allows us to interpret the vector like this. Just like you, I see on the board that perhaps a genie has done Bayesian inference, or when you do maximum likelihood estimation, you take a parametric model and you cook up a probabilistic model from out of it. You say, this is my probabilistic model of the data and then fit that model. Similarly, we are going to say this is my probabilistic model of the data and that is why the output of this model is my desired probability distribution. I'm going to fit the parameters w to make sure that this is as close to my true probability distribution, which is the annotator's distribution in my dataset. Before we get there, I will take the opportunity to tell you about the logistic loss. You have seen this before, perhaps. Logistic loss is used in logistic regression. You might have seen this before. It is simply a binary classifier where we, you can think of it as a two dimensional variable. Now, it is the probability of there being a cat in this image or there not being a cat in this image. And you know that these are two mutually exclusive events. So they add up to one. And so you don't really need to maintain two variables. You can only maintain one variable. And the other probability is simply one minus this variable. People will maintain this or people will build a model for this in this peculiar way. They'll define the output. So this is your, let's say, a linear model, a linear logistic regression. The output of this model, I choose to think of it as the logarithm of the ratio of class one given X and class zero given X. The sum of the numerator and denominator equals one. And the ratio is any number that lies between minus infinity and infinity. The logarithm of the ratio is any number that lies within minus infinity and infinity. And so no matter what output the network makes, the predicts, I can write the, I can cook up my probabilities by simply looking at this formula. You will see that the formula one given X is can be written down as one plus e raised to minus y hat, which is the sigmoid function. Now, given the probabilistic model of the data like this, you can say I have my dataset X1 to Xn. What is the probability of me finding the right labels for all the right images? This is a joint likelihood. We know that images were promised to us in the training dataset that are independently drawn from the same probability distribution. So this entire joint distribution factors out over my input images. And this is the probability of every one of these images. For image number i, the probability of you predicting a one is getting multiplied or is getting exponentiated by yi. So if yi is one, then one minus yi is zero. So this term doesn't play any role in it. And all you have is the probability of you predicting a one. If yi is zero, which is the second class, then this term doesn't play any role. And all you have is the probability of you breaking your zero. So this is just a funny way of writing down the probability of the true label. The reason this is nice is that when you take the logarithm of this, you will get an expression of this kind yi times logarithm of one given xi plus one minus yi times logarithm of zero given xi. This is what is called the binary cross entropy loss. And this is what you would use or this is what you do use when you train logistic regression. We are going to do just a multi-dimensional version of the exact same loss. So what we are going to or what we would like to think about is how to write, there are two classes here. In the case of a neural network, there might be an arbitrary large number of classes or two classes to begin with. And so we would like to write down expressions of this kind for our model. So in some sense, we can think of writing down the loss to be like this. It is the one-hot vector of the true label y, the kth element of that one hot vector times the logarithm of the probability of k given x. Now the issue with this is that for the logistic regression, we could say that the network predicted one number and that is why I choose to interpret that number as the ratio of the probabilities of different classes. If we have 10 classes, then you don't really know what this ratio should be or there is no nice way of creating such a ratio. And so even if we know what the first part of this summation should be or summons should be, we don't know how to cook up this part from the output of the network y hat. Y hat is just an arbitrary Euclidean vector and we would now in 10 dimensions, if you have 10 classes, we would now like to interpret it or write down the probabilistic model that corresponds to a classification problem being solved by a deep network. And that is where people use ideas that are called softmax. So let us construct a softmax distribution. We know that whatever probability distribution we create has to come up to 1. But of course, if I simply say that the output of my model is k given x, then the output may not sum up to 1. More importantly, the output may not even be positive. This is a probability and it is supposed to be a positive number. But the y hat that the network predicts need not always be a positive number because it is just a bunch of weights multiplying some features. So we can play the same trick again. We can say that the output is not a probability by itself. The logarithm of the probability is what I think the output is. So I will say that log p is proportional to y and if you solve for this, you will see that the probability of k given x under our choice like this under our interpretation is e raised to y hat k divided by e raised to y hat k prime for all the other k's. This is what is called a softmax distribution. And I will draw a couple of pictures. Let us say that this is 0 and this is 9. If this is y hat, y hat could be something. Some of them are positive, some of them are negative, etc. e raised to y hat k divided by t and then you sum it up over e raised to y hat k prime divided by t. This is a distribution such that if t is very large, then it doesn't really matter what each of the terms here are. They are all quite equal, the different terms for different k's. And so if t is a very large quantity, then the probability distribution that you are cooking up will tend to be quite uniform from this. If t is very close to zero, then this entire exponent is the largest one and all the ones who are even a tiny bit smaller in terms of their yk have very small probabilities. So if t is very close to zero, then you will have a huge probability over here and you will have a slightly smaller probability over here. You will have an even smaller probability here and you will have an even smaller probability here. So this is when t tends to zero, this is when t tends to infinity. So by picking the right value of t, the typical value that people choose is simply 1. They will take the vector y hat and then cook up this particular probability distribution from this and now you are allowed to think of the network as creating a probability distribution, this probability distribution. This operation as I said is called a softmax, which is a little bit of an incorrect name in the sense that it never quite takes a maximum, but it exponents and then normalizes. So you are free to call this exponentiate and normalize, but no one will understand you, so you might as well call it softmax at this point. Okay, given the probabilistic model like this, now we can simply write down this loss. As you can see, this loss is a function of the weights w. This probability distribution is a function of y hat, y hat. So if you take the derivative of this entire thing with respect to w, which is what you do when you do back propagation, you will get things that you will get d y hat by dw and then back propagation, we'll be able to calculate this for us. Yeah, good. So this is, I think all the ingredients of typical classification models, different types of layers, different types of losses, one for regression or classification. There is many, many more jargon or bells and whistles that people have invented on top of neural networks. And instead of simply kind of giving you some names and then it will all look very pedantic at some level. What I wanted to do next was give you an appreciation for why deep learning is hard and why or when we cannot do deep learning very well. Some of you may have heard of I think something called as a bias variance trade-off. The bias variance trade-off is the following object. So let us think of regression. In regression, we would like to minimize the error between your labels yi and predictions f of xi f is my model. And this is the loss that we minimize. That's a mean square error loss in regression. Of course, as I said in the first lecture, we don't really care for this loss. This is simply the loss on the training set. What we do care about is the loss on the test set, which is give me all possible x's and y's that are drawn from nature's distribution p. I will check how well my model predicts the true y coming from nature. And if this number is large, then my model doesn't generalize very well. If this number is small, then I'm happy because I match nature's y pretty well. So this is the quantity that we would like to minimize. The only quantity that we can minimize is this one, the training dataset. This is not just so you can see that if n goes to infinity, then this one converges to this by simply the strong law of large numbers. So it is not as if we are doing upset things by demanding that we want a small value of capital R given our ability to only check our hat. But of course, with few samples, you may be finding functions f that overfit your training data and that do not have a small value of f. This is precisely what we said was overfitting was when we did polynomial regression. And so bias variance tradeoff is a hot experiment, more or less, to understand the discrepancy between our hat and our and the way people will write it is as follows. So the difference between or you can write down the population risk, which is the stuff that we do want to minimize as the combination of three different terms. The first term, let us look at this particular one. It is called the base error. It is defined to be like this. It is the expected value of x comma y drawn from nature distribution, the y that nature attributes as the labels to this data. And f star is the best model that anyone could create. It is not about us anymore. f star is the best model of the data of the labels. So in some sense, a base error is a quantity that distinguishes between or that characterizes how random y is, why that nature creates. If the labels were being obtained in your data set by a perfectly deterministic mechanism between x and y, then the base error would be zero because the best predictor f star would exactly be y. If y was a deterministic function of x, this is how f star is defined. You don't know it yet here, but trust me when I say that. If nature secretly was adding some noise before it gave you labels, then this would be a nonzero number. So noise in machine learning comes from different places. It could come from an actual physical experiment that you perform in a lab where your sensor is noisy and anything that you can ever do is simply record the output of the sensor or it could come from people annotating data. Different people have different opinions and they will disagree with each other sometimes. And then when you write down the majority answer, it may be incorrect with a certain small probability. So such things are called base error. Base error is the irreducible error for our machine learning model. If nature's labels themselves are non-deterministic, then there is no way we can fit them perfectly. So this is the irreducible error. We shouldn't worry about it. Typically, this error is quite small for a lot of data sets, but if you are solving a new problem that people did not solve yet and you collected your own data, then you can get into situations where the base error itself is quite large. The first two terms look like this. So f of x comma d, I will denote as the model f that was learned using a training data set B of n samples. Bias is the gap between the model learned on different data sets, the average predictions of the model learned on different data sets and the best predictions f star. f star is remember something that we don't get to see or we don't get to use ever. It is the best model that you could ever have for the outputs. So in some sense, this is the one you want. This is the one that you have that you can calculate at the end of your training. This is the predictions that you get if you train many, many times on many, many different data sets. So on average, this is how far away you are from f star. It is the bias and it has a very natural meaning for the word bias. We would like to be as close to this as possible. This is how far you are on average. The variance is a term that doesn't really care about f star. It simply says how different the models are when they are trained on different data sets. It is simply the variance of the term f of x comma d. If I gave you a slightly different data set, how different would your predictions on a test data B? That is what variance characterizes. So in a sense, if the center of the bull's eye is f star, this is the model that you're trying to find for the data. If you have low bias, you're in the vicinity of f star. If you have low variance, the cloud of the points is also quite small. If you are, if you have low bias, you're in the vicinity again. High variance would correspond to slightly larger cloud and so on and so forth. This is how large bias looks like. Of course, we want a situation that looks like this because the population risk is a sum of the square of the bias plus the variance plus the base error. The name of the game in deep learning or all machine learning is to make sure that you have a model that is big enough to capture all the complicated ways that the labels depend on the inputs. So it has a small bias, but it is not so big enough that finding one element, finding the model becomes difficult. Finding the model requires it to have lots and lots of data. If I give you small amounts of data and force you to fit a very large model, you will, of course, be able to fit it, but then the number of models that fit this small amount of data are also very many. And this is why you will get a large variance when you fit such models. So if I plot the size of the model on the x-axis and this is a slightly difficult thing to understand at this point, but let us simply think of it as the number of weights of your neural network or any model to be precise. The y-axis is simply r of f. You can estimate it using the validation data set. Bias is a quantity that goes down as I give you more and more data. As I give you more and more data, you cannot make very many mistakes as compared to the true hypothesis f star or the true model f star. And so you should expect bias to go down as you give me more and more weights because the model is getting more and more rich and you will be close to it. So you are on this particular row as you increase the number of weights. But at the same time, the variance blows up because as the model grows larger, there's a larger and larger set that you're trying to search over. Every point in this set could be a solution. Depending on the data set that I give you, you may find this solution as if I give a slightly different data set, you may find a slightly different solution for what f of x comma d is and that will force you to be in this particular column. So the sum of these two is minimized when is minimized at some intermediate point, which is the bottom of this y-curve. So this is why you want to choose models that are neither too big nor too small, not too small because the bias should be relatively small. You're not even having expressive power to fit your data if the model is too small. And we don't want the models to be too big because with this amount of data, we don't know out of all the possible true answers on our training set, which one of them is the one that minimizes r of f. And so typically when you do machine learning, you will increase in principle the size of the model or increase some regularization parameter and find the bottom of this y-curve and say, this is the model that I would like to use for doing inference and test time with. Now, the issue with deep networks is that they don't quite seem to conform to this very classical understanding. So bias where I said off is nothing very complicated. It has been known for many years now, of course. Neural networks do not have just a u. People noticed that some large models, the error on the test dataset, which is r of f, it increases as you add more and more parameters to your model. So this is, again, let's say, the number of weights, but then it also goes down again and it keeps on going down seemingly forever. This is a very surprising finding and people have also studied by now why it occurs or how we should understand it, et cetera. But it is something called as a double descent phenomenon. It suggests that large models, in particular large neural networks, may not suffer from variance somehow. They have a small bias and everyone appreciates why, but it is very surprising why you do not suffer from the large variance when you fit a large neural network. And this is why people will say that larger the model, the better the error because this curve seemingly keeps on going down forever. What we're going to do next is try to understand this phenomenon. Any questions before we begin? Seems everyone is happy on this side. Ah, cool. Thank you. And now I will do some slides. Okay, so you can see my slides, right? Okay, cool. Great. So we have been trying to understand why neural networks do not overfit. Why does the error keep going down as you add more and more samples? And so I'll give you some perspective of tools that we use to study this problem. And this is the beginning of the talk. So for people like me who are interested in understanding some kind of principles that govern learning systems, whether they're artificial learning systems or biological learning systems, there is two very big questions in the context of artificial networks. We have said many times over that these are very large models, typical networks will have millions of weights. These days, people train even larger models with billions of weights. So it's a very high dimensional space in which you are trying to find one of the solutions of this model. The data sets that we use to fit neural networks also come in directly large. So they'll have millions of images, billions of words collected from all over the internet. And the function that we said is also non-convex function of the weights and the inputs. So non-convex optimization in 10 dimensions is very hard. And really, we shouldn't be able to train a network if all of these three things are true. So something is seemingly easy in the problem that we are writing down. And it's not quite clear what. And this is the question that I spent a lot of my PhD trying to understand. Why is it that we can train efficiently when all mathematics says you shouldn't be allowed to. And the ref answer of that thesis is that the function is not convex, but it is not. It is a very benign kind of non-convexity. So it is essentially convex, but it is clearly not convex. And that is why every time you initialize the model, there is one solution very close to you and you find such solution. There's many, many solutions to find. So you do not make mistakes in finding. It's not very hard to find one of the solutions. The second question, which we'll talk about now is why is it that deep networks generalize so well? You know that if you fit more degrees of freedom than the number of samples, then you can overfit. Somehow, like the double descent phenomenon suggests, unlike the bias weight and straight off, the accuracy of predictions of neural networks does not go down as you increase in a rough weight. They keep on performing better and better. And this doesn't quite make sense because it goes against everything that we know from statistics. So something must be very special in what we are doing that makes these networks work so well. Okay. Let us ignore this part. So what I'm going to tell you next is some new ideas and here is how I'll introduce them. So these two questions have been at the forefront of theoretical research in deep learning for a good seven, eight years now. It is not a new thing. Of course, people realize that these models do not come from to all established theory. That is the first thing that you would realize. And then people have invented lots of ways of thinking about this. Some people will say that these networks are very, very large. They always tend to have more parameters than the number of samples. So overparameterization is probably the reason that allows them to work so well. And now you can do this experiment on your laptop. You can train a small neural network. It will still work very well. So in some sense, overparameterization may be one piece of the puzzle, but it cannot be the complete answer because small networks also work well. And in some very, very naive sense, if I have parameters that are not constrained by the data, then why the heck do the parameters play a role in making predictions on the data? This is obviously a very, very naive argument, but at some level it makes sense. And I'll tell you precisely what in the remainder of this talk. We cannot think of overparameterization as the only answer. Some other people will say that the way we train these networks is using stochastic gradient descent, which is a rather special algorithm. It is a simple algorithm, but it is particularly good for machine learning. If you see the last lecture of the latex notes, you will see why. And while other optimization methods do not work well for deep learning, SGD does. And so there must be something very special in how SGD behaves in a neural network. And I was exactly this person until a couple of years ago. I have changed now, but then, and the reason for this is that there is many different ways of training neural networks that are based on SGD, that are minor tweaks of SGD, thousands of different ways. And they all seem to kind of work. So it cannot be that just the training procedure is something mysterious. And then just because you have many batches and stochastic gradient descent as the training method, you magically get good generalization. And we could keep going down the room and then give names to many, all of these people. What I want to convince you about is that the typical data that we use to do machine learning in the modern era is rather special. And that is the elephant in the room. If you understand how the data is structured, then essentially all these other findings become explainable with that one idea. So this is a paper. You can read it. It is also an archive Rubin Yang who is a PhD student in my group worked on this paper. And then Jialin Mao held on this. It was presented at this conference called ICML in summer this year. Let us begin with this experiment. So focus on the blue line for a second. So what I did is these are images. I took a dataset of images. It is a dataset called CIFAR. It has 10 different objects, 10 different categories, stars, cats, dogs, planes, horses, giraffes, frogs, buses, trucks, planes, maybe that's it. And each image is a 32 plus 32 image RGB. So that gives you 32 times 32 times 3, which is 3072 different dimensions to the input data. And that is the number here, these are 72. So I strung up every image as a big vector and created a matrix X. The columns of this matrix are different images. There is 50,000 images in this dataset. So 50,000 columns. And the rows of this matrix are all the dimensions of my input. So 3072 dimensions. This is a matrix X of size 3072 times 50,000. The blue lines are the eigenvalues of X, X transpose. So if you have seen principle component analysis, the blue lines, the blue line is exactly the eigenvalues of principle component analysis. There is 3072 eigenvalues because that is the rank of the matrix or that is the upper bound for the rank of the matrix. And what you will notice immediately is that the eigenvalues of the data correlation matrix or the input correlation matrix, they drop very, very quickly. It is often believed in computer vision in particular, but essentially all machine learning that machine learning is possible or reasonable when input data lives on some low dimensional manifold. Even if images that we click have many more, many pixels, 50 million pixels, the signal that gives these images, the structure of the physical world may be an inherently low dimensional quantity. And that is why we can make inference from such large amounts of data accurately on the physical scene. So it is often believed that input data is low dimensional. It is not exactly low dimensional. The blue line is a full rank up to 3072, but it is effectively low dimensional because the eigenvalues drop very quickly. So the variations in the dataset primarily come from few dimensions, let's say about 150 or so in this case. And then there is a long tail of tiny, tiny variations that come from all these different eigenvectors. Okay. Until now, this is just a statement about input data sets. There is nothing very special or there is no learning so far. What we showed in this paper was that when the input data looks like this, then essentially any function that you learn inside a neural network also has a similar shape. So let us first look at the Hessian. The Hessian is a matrix that is the second term in the Taylor's expansion of the loss. So we use the training loss that checks how well the labels are being predicted by the softmax output, the cross entropy loss. You can take a train network and calculate the second derivative of the loss. This is a huge matrix. Its size is number of weights times number of weights. And these are the first two 2100 or so eigenvalues of the Hessian. It is very expensive to calculate. So we only did these many. The network has three or four million weights. So the Hessian will have whatever rank it does. In this case, it will have a rank of 50,000. But we did the first 2000 or so eigenvalues. They also drop very quickly with the data. And then they will have like this like this shallower decay. Okay. What does this tell you? If the matrix that defines the curvature has a few large eigenvalues and many small eigenvalues, then these are the directions. The eigenvectors that are corresponding to these eigenvalues are the directions in the weight space in which I can move and not change the loss too much. It is the curvature of the loss after all. So what such a pattern for the decay of the eigenvalue suggests is that there is many directions to move. So think about this. These are the first 150 or so eigenvalues. And I said that this has three million weights. So after about a three million minus 150 eigenvalues are smaller than the largest eigenvalue by about four orders of magnitude. This is an humongously under constrained system where the loss function looks locally flat, no matter which direction you look at. And then these are the few directions in which the loss function curves up locally. Okay. So that is the Hessian of a trained neural network. The Fisher information matrix is a very similar quantity to the Hessian. It is the curvature of the output of the model. Hessian is the curvature of the loss of the model. Loss is the average over all the samples. Fisher information matrix is the curvature of the output. So why hat, if you will, if you're doing regression. And again, the average over all the samples in the dataset. And this one is a tiny bit easier to calculate because it is, you can write it up as a sum of a bunch of rank one terms. And this one we did like, let's say the first 3000 eigenvalues. And you will notice that it also drops in the beginning and it also slopes with very similar slope like the data matrix. Now, the cute thing about this is that the Fisher information matrix for linear regression is exactly equal to the data matrix. This is just a very simple derivation that you can do. And this is very interesting because the orange line and the green line and the blue line would be identical if the model was linear. This is a large neural network. So it is not linear. It is also convolutional network. It is not a linear model. But what this picture is saying is that it is essentially linear, at least in this spectral sense as far as the Fisher is concerned. Okay, there are two quantities that look like the data and they govern optimization in some sense. If you take one of the logits, which is the output of the model y hat, we take one of the y hats, let's say y hat 1 and take the Jacobian of y hat with respect to the weights. So d y hat by d weights and calculate the eigenvalues of that particular Jacobian. Then you will get the red line, which also drops down and decays with some slope. Okay, depending on, for this one, we also did only 1000 because this is another large matrix of size, number of classes divided by the number of weights. This is the eigenvalues are the singular values of a matrix, which is number of weights times number of weights. These are the activations, different layers will have different activations. These are the, this is the matrix of the correlations of the activations are different layers. Again, it drops down and slopes back. Now, we call such eigenspectra sloppy for the following reason. So if you pick one of the small eigenvectors, as I said, the loss doesn't change too much. But let us focus on the Fisher for a second. If I take one of the small eigenvectors, I perturb my weights in that direction. What this plot tells us is that I will have to perturb the weights by seven or eight orders of magnitude before I see any appreciable change in the output of the function, the orange curve is the curvature of the output. So that is the order of magnitude by which I have to perturb my weights. And that tells you that these weights that correspond to these eigenvectors are very, very under constrained. The function essentially looks like these weights. And then these are all the decorations that I do and doesn't really matter how much I change the decorations. I will still keep on making the same kinds of predictions. Okay, this is, and so in some very visual sense, it looks a little bit like you're taking this gigantic set of parameters and then putting them down on the table. And if something falls off, if something doesn't stick, it doesn't really matter because these are the only few weights that are being used and all the three million minus 150 combinations are really not constrained by the few samples that you have. So there's 50,000 samples in this particular experiment of 50,000 images of 10 classes and then three million weights. So it is a humongously over parameterized problem. If you think a little bit, what lies in the head of the data, the head of the data for the blue line is, so this is a picture corresponding to Plato's theory of forms. These are all the purest cats in your data set. They will all lie here. There's the purest dogs in the data set will all lie here. What lies in the tail of the data set? Well, it could either be highly redundant structure, which doesn't change very much. Let's say blue sky that you'll see in essentially all images, very similar stuff. Or it could be unusual structure that you don't see very often. So this is a blue dog. There are not that many blue dogs in the world. So blue dog will lie here. Now, the way to understand this result is that when the network learns, it is learning these eigenvectors sequentially. And this is a very short proof that you can write down that in the first few parts, it fits this eigenvector. In the next few parts, it fits the next eigenvector. And it slowly digs through this entire tail of tiny, tiny parts of the signal that make the predictions. And while 10 years ago, we were happy to only play in this part of the eigen spectrum and get accurate predictions with, let's say, 95% accuracy. These days, we seek a little more. We want the extra 5% of accuracy. And in order to get that 5% of accuracy, you have to dig through all this tail until you reach the unusual blue dog. And while there is not that many blue dogs in the world, now you care for them. So now you have to fight through the tail to get that. The reason you do not seem to suffer from overfitting for such problems, even if you are going all the way down into the tail, and this is clearly an unusual datum, so you could overfit on it, is that the tail is so much smaller than the head. And so even if noise were to come and perturb some parts of the tail, noise would have to perturb the tail by seven orders of magnitude before it would get to dominate the head. So for datasets that are sloppy, and the models that we learn on these datasets are also sloppy. And the models do not overfit on such datasets because they have a very special pattern for these, they have a very special structure. The smallest eigenvalue is much, much smaller than the largest eigenvalue. If noise were to come and change this part of the spectrum, then of course the model would make more mistakes. But then I wouldn't call it noise at that point. Okay, I would call it you using a wrong microscope to measure things, for instance, it would be a genuinely wrong data. Before we proceed, any questions? Hello, Pratik. I have a question that may be naive. So from what you said now, this thing about learning first the larger eigenvalues allow you to learn the main structure in the data and allow you to get this 95%. And so the idea is that when you get new data, this data would be different, but different in the lower eigenvalues, in the ones that have a smaller value. And so you're saying that this perturbation will not affect your prediction, that's why you generalize well. Yes. But then I can understand this argument for why we can get this 95% accuracy when we try to generalize. But how do we get this extra 5% if this extra 5%, I mean, I don't know if it is clear the question. But if I get my training data, and my test data are very similar for the high, big eigenvalues, but they are this similar in the smaller eigenvalues, I would expect that I can get a accuracy that is 95%. So I think there is two parts to your question. What you are talking about is a genuine difference in the distribution of the training and the test data, where the test data is different from the training data in the tail of the spectrum. Now you cannot get good answers for such kinds of data. This is distribution shift. And while you get 95% accuracy, that is essentially the best you can get because the data is different. What I'm talking about is something slightly different that the training and the test data come from the exact same distribution. And if I think of drawing test data from a slightly different distribution over here, why do I not seem to overfit on the training data? So now, depending on which part of the tail, I imagine my new test data to come from, the output of the model will not change much. So we are fitting this data set, we are fitting the data set with the training and the test distribution to be the same. And even if we have finite samples for the training data set, we don't seem to overfit because even if it did overfit, we will not notice the difference between overfitting and fitting. This is for an eigenspectrum of this shape, fitting is, you do not get hurt by interpolation because there is not very many different things that can happen at this time in the data. Okay. There is also a question in the chat, Pratik, if you can see it. Yeah, one second. Can you give me one second? So there is someone at the door, I don't understand why they're knocking. Yes, I'm back. In the case of Hessian, local high curvature may not imply that the global extremum of the loss function can be obtained in those directions. Cannot lower local curvature values lead to global extrema of the loss function? Yes, yes. The curvature of a function has nothing to do with where you are. This is simply a plot of the curvature of the function at the end of training. Presumably, you are at a good place at the end of training. So it doesn't mean that we should be in a low curvature part or a high curvature part. I am simply showing you a plot of the typical kinds of curvature you find at the end of training. Okay. Cool. So here is some thing that we can show. We cannot quite show that deep networks are sloppy directly, but here is what we can show. If the input spectrum is sloppy, if the input spectrum is large in the beginning and then slopes down very quickly at the end, then I would like to show that the Fisher information metric or the Hessian are also sloppy. They also have large eigenvalues and then eigenvalues that are distributed uniformly across on a log scale. Okay. Now, I cannot prove the title of this slide precisely, and so that is why I put a little tilde here. I can prove it in bits and parts. And so let us look at the first calculation. The trace of the Fisher matrix, which is the sum of the eigenvalues of the Fisher matrix or the trace of the Hessian, is upper bounded by the trace of the data correlation matrix, which is the sum of the eigenvalues of the blue line plus some stuff that depends on the weights of the network. Okay. Now, this is a pretty simple calculation, but the way to understand this is as follows. When people study neural networks, they often like to focus on this quantity because that tells us like how many different weights give you the same outputs or what is the shattering dimension of a neural networks and so on and so forth. So they will analyze these quantities, but I would like to analyze this quantity and play the following game. I can always put my weights inside an L2 ball and force this term to be upper bounded by some constant uniformly. Okay. So let us imagine that our weights live in some nice ball and then this is just one big constant. It doesn't matter what it is. If I take the network and stretch it out now, I make the network wider and wider. I keep adding more and more parameters to this network while still making sure that the weights lie within the same ball. The left-hand side has more and more number of eigenvalues. The right-hand side is a constant. So we at least know that in this very concocted construction, the eigenvalues of the Fisher matrix or the Hessian for these large neural networks have to decay. Okay. So this is a doesn't quite say that sloppy inputs lead to a sloppy Fisher, but this is a doesn't say that the eigenvalues have to decay exponentially. They do decay exponentially in experiments, but they tell us that they should we should expect them to decay. You can also cook up simpler models for neural networks by for instance linearizing the model around its initialization and then you get a linear model and you analyze the linear model in which case we can show this title precisely. You can also construct situations when the network is infinitely wide in which case this linearization is actually a good linearization and you can show that the generalization makes sense if the network is infinitely wide. Again, we can show the title properly. For kernel machines, we talked about kernels in the first lecture. We can show the title again, but not for generic neural networks mostly because different way the weights at different layers, they will transform the input spectrum in many many different ways. And there is not very much you can say about the output spectrum without saying very without making very narrow assumptions on what the weights are. Let's say using random matrix theory. We can also show that the blocks of the Fisher information metric, the Fisher information metric is a matrix of size number of weights times number of weights and the blocks of this matrix on the diagonal. The eigenvalues of every block that roughly speaking corresponds to all the activations of one particular layer of the network are upper bounded by some stuff that depends on the weights again and the eigenvalues of the correlations of the activations of that block. So if the activations of that particular layer are sloppy and they do seem to be the purple line also decays down after a big start, then the eigenvalues of that block of the Fisher are also sloppy and this is like a sequence wise domination. So that is what I call by this parameter by this operator spectrum. Now, this is not quite a result for the entire Fisher matrix, but for the last neural networks, it has been noticed that the Fisher matrix is pretty close to the block diagonal approximation because the different layers actually are pretty different in how they affect the output. So this is some theoretical understanding and that tells you that the Fisher information metric or the Hessian also is sloppy because the inputs to these models are sloppy. We alluded to how the network may not actually overfit because the Fisher information metric is sloppy. So the orange line is here. This is the eigenvector. It doesn't matter what noise does over here. It will not be able to affect what errors I make in this part of the spectrum. Can I ask you a question, Pratik? Go back to the theorem. So how do I have to read these inequality? So what is varying in these inequalities? Because C is a constant. I mean, obviously you're changing the size of the data, the size of the network or something like that. Otherwise, if I choose a constant every time, I can always write an inequality. No. So I would like to say that the left hand side is something that depends on the are the eigenvalues of the Hessian, let's say, the weights of the network. They are usually, it is very easy to control the magnitude of the weights just by L2 regularization. So I can always pretend that this factor is upper bounded by some constant. So the easiest way to understand this inequality is that it tells us what the trace of the Hessian is for a network whose weights lie in some L2 ball of my choice. And for very large networks, as the size of the networks keeps increasing, the extra eigenvalues that are being added on the left hand side have to be smaller and smaller because the right hand side is a constant. Okay. And that constant C is calculable? Yeah. So you can write down expression for C. It depends on the derivative of the activation function and some, it's a function of the number of layers and such. I see. Okay. Basically, this is just a ellipsis constant style argument. There is nothing very big that is happening. And this multiplicative stuff comes simply because you are going up the layers. Okay, thank you. Okay. So everything that I said doesn't seem to be very specific about neural networks or the specific vision datasets. Many models in nature are also sloppy. So I'll give you an example of this. I think most of you know this more better than I do. This is a picture that systems biologists would like to write down when they think of a cell signaling network. This is a network of proteins that exists within the cell and these proteins interact together in chemical reactions to regulate certain other things. So this is the network for, let's say, one part of EJFR regulation. Okay. Now, mathematically, you can think of this system as a system of coupled differential equations. Each reaction corresponds to one differential equation. The output of this model is, let's say, the concentration of EJFR. The parameters of this model are the rate constants of the different reactions or the concentration of these different proteins in that player role in these reactions. And I can think of this as simply a parametric model. I can record data from such a system. I can fit a nonlinear model for such a system. And I can calculate the Fisher-Mission metric for this system. Again, you see that the eigenvalues are distributed uniformly over a very large range, up to about six orders of magnitude. And there is a very nice reason why biology likes such systems. It tells you that there is two combinations of parameters that are playing a dominant role in regulation of the output, even if the actual system has many degrees of freedom. This is useful because if something happens to one part of this circuit, you don't want the output to be very bad. If the output does not depend strongly on any of the parameters, if it depends strongly on only two combinations of the parameters, not parameters themselves, then there is many parameters that you can change. In fact, these are precisely the eigenvectors that correspond that give you those parameters. You can change them without actually affecting what the system does, what the behavior of the system is. You can read this very beautiful paper that is a long review of about 20 years of work from Jim Setner's group who discovered sloppy models when they were fitting data from systems biology. This is at least the model of some natural data. Most of you know the Ising model. You can think of the Ising model as creating an output distribution on the spins where the parameters of the model are the coupling constants that define how nearby spins are like each other's values. This also has eigenvalues that are spread over a very large range. That simply tells you that the exact values of the coupling constants of all the coupling constants do not affect what the output distribution of the spins is. We know that if the spins, for instance, take some positive value and the temperature is quite small, then you will be in some sort of magnetic phase and the magnetic field is that's a positive. There is a lot of redundancy in how the output of such a system is created by its degrees of freedom. That is what sloppiness seems to point to when you have systems with large degrees of freedom, but the behavior of the system doesn't really care or doesn't really need all these degrees of freedom. The circuit in biology presumably evolved to precisely have this particular behavior to not depend too much on anything but still keep on working even if things are missing. The Ising model is just an artificial model of how we think of magnets or some other phenomena and it is a faithful model in many cases and that also has some bearings of this pattern. You can play this game for many other systems. This is a power system where you can think of the output voltage at some part of the city as a function of how many generators are creating power in different parts of the grid. That is a weak function of all these different variables, etc. In some sense, sloppiness seems to be very pervasive in many models that people have studied in the past and this list goes on. I haven't found a dataset which is not sloppy. If you find one, I promise to buy your beer. What I would like to argue and what the physicists missed, for instance, is that they have been studying how models are sloppy. Models are after all models of our data. What I would like to study is really how and why the data is sloppy. These kind of calculations show that if the data is sloppy, the model is sloppy. If the model is sloppy, the extra parameters don't hurt. But what is a very deep question of intelligence is why does data have such low dimension structures that allow us to learn well? Here is one application. It is very related to the bias variance trade-off and this is how people will begin explaining what are called PAC-based bounds. PAC stands for probably approximately correct. It is a particular way of formalizing learning and people in learning theory will say these kinds of things. They'll say that you give me some data, I give you a model and this model works, has so and so accuracy. I don't demand the model to be perfectly accurate with respect to nature's labels. That is why I'm happy with an approximately correct model. That's fine. Depending on the some data that you give me, with the probability delta, I'm allowed to give you a bad model. Of course, and this is the probably part. With the high probability, at least one minus delta, over all the data sets that you give me, I will return to you a model that performs well, that performs approximately well. That is PAC learning. There's this professor called Leslie Valiant who formalized this and then it is kind of a very nice and long theory for thinking about what it means to learn. They write down expressions of this kind. Let Q be a probability distribution on the weight space. So every particular weight configuration defines a model for us, defines a neural network for us that we can use to make inferences on test data points and Q is a probability decision over all such models. Little e of Q is the probability that you make a mistake on a test data drawn from nature's distribution if you sample one model from this probability distribution Q. So the average probability of you making an error. And now this is the probability of you making a mistake. So it is something that is smaller than one. e hat of Q is the probability of you making a mistake on the training samples. And that says that look, if I created a Q at the end of training, you get one network. If at the end of training you take a bunch of networks and think of them as your distribution Q, then this is the average training error of these different networks that you have in your bag. And of course, that will be also a probability of you making a mistake on the training point because you're trained on them. This number will be typically quite small, smaller than this e. And so people will write down inequalities of this kind. The test error, the probability of you making a mistake on the test data is less than the training error plus some term that depends where the numerator is the feedback library divergence between the distribution Q and some prior P ignore the prior P. Let's say that it is this uniform over some compact space. And in this case, you will simply get the entropy of this distribution Q, which is equal to the number of different models that this distribution Q puts its probability mass on. If the number of models that it puts its mass on is large, then this expression tells you that you need a proportionally large amount of samples n to find one such model. And this makes total sense. If you have a large model that you're trying to fit, then I also need to give you more and more amounts of data. And this is precisely how much more data I should give you to keep your estimate of the error, which is a training error close to the true error, which is the test error. Okay. Now, the issue with these kinds of inequalities, and they're very old inequalities, 30 years old, perhaps in this very form, is that when you plug in numbers into these inequalities, you get absurd answers for neural networks. The test error of a typical network that let's say you will train on your laptop is 5%. The training error of this network will be roughly zero. And so this term will just be zero. But because there is such a large number of weights in the network, the right hand side will typically be much larger than one, let's say 100. Now, this inequality will then read like 0.05 less than equal to 100. The left hand side being the probability of you making a mistake. And you don't need a theory or a theorem to say that the probability of you making a mistake is less than 100. Okay. So these kinds of inequalities become vacuous. And it has been the source of a lot of anxiety in learning theory that says we need new theory for understanding deep networks because they obviously do not conform to what we have. And it's in some sense huge source of embarrassment, I guess, for us theorists, because we think this is a pretty well-defined way of thinking about learning, but somehow it doesn't even match the simplest experiment that we could perform today. Okay. What we did in this paper is we did a pretty simple calculation. We did a Taylor's expansion of this term and optimized our Gaussian probability distributions on the weight space and Gaussian priors p. So in that sense, you can do this in closed form. The mean of the Gaussian is a train network and the variance of the Gaussian is what you're optimizing to find q. We can give an analytical non-vacuous bound for small neural networks. Okay. So for a three-layer network on the MNIST dataset, this is 0 to 9 digits, 10 digits. The right hand side is 0.38. And the left hand side is something like 0.02 or 0.03. So it is not as if this inequality is tight under this calculation, but the right hand side is actually smaller than 1. So there is two ways of understanding this. One is that instead of searching for new kinds of theory to understand deep learning, if you simply use the existing theory, but appreciate the fact that the number of parameters in these models is not what controls their complexity, but really the number of different degrees of freedom for sloppy datasets, the number of degrees of freedom in the learned model is very small. And there is many, many redundant degrees. So this calculation that we did exactly remove the redundant degrees because they don't affect the output and simply measure the model in terms of the salient degrees of freedom. And if you appreciate the fact that the model was trained on the data and the data has some structure, that is what the model has some structure, then existing theory works just fine to understand deep learning. We have developed a lot of sophisticated numerical ways of optimizing the right hand side of this inequality. And there you can get essentially tight inequalities also. So fact-based bounds, this is simply a name given to these kinds of constructions. We can get non-vacuous analytical fact-based bounds for neural networks using the concept of sloppiness. Here is one thing to remember when you leave, how big are deep networks? Well, you give me a lot of parameters, but you don't give me requisite amounts of data to fit all these parameters. So there's lots of parameters that are left floating that don't have to take any well-defined values. Roughly speaking, we can mark out the elbow. This is what shows up in the calculations, the elbow of this eigen-spectrum as something special. The number of eigenvalues to the left of the elbow is a quantity that we define to be the effective dimensionality of a network. And it has this very nice form. The constant epsilon is simply the scale of the prior. It is not very important. So the degrees of freedom of the effective dimensionality of the network is the number of eigenvalues of the Hessian that are larger than epsilon divided by...