 Thank you. You hear me? All right, so good evening. It's really a pleasure to be here. I don't often get to talk to people from so many different countries, which you hardly see in Israel, so this is really a nice opportunity. I am not going to give you a crash course and compress a year into two hours. So actually the story I want to tell you is really about some recent relatively new theory, which is not even completely or published at all, and it has to do with better understanding of deep neural networks. So I know you all heard a little bit about deep learning, but not that many of you actually do deep learning. I mean very few. Just to calibrate my level of expectation. So first of all, I'm going to talk slowly at first, but I really want you to follow me. So whenever I say something you don't understand or concept that you don't recognize, raise your hands. I really want you to be with me, even if I get a lot less material. So I ask questions during the talks. That's important. So as you know, I'm sure you all know that AI, artificial intelligence, is really a process that has very clear three different phases. The first one started really with Turing, with the idea that our brain is nothing more than a Turing machine, and that intelligence is essentially some very clever algorithm that runs in our brain. So not all of you may buy this, but this was the starting point of what we now call AI. And actually the first phase of artificial intelligence between the 50s and the 80s, 40 years or 30 years of essentially what we may call now the naive AI, or it was really based on logic. I mean the idea that people had at that time, mainly computer scientists, is that in order to program an intelligent Turing machine or intelligent machine, all you need to do is to ask an expert, I mean how do you do things and then program the rules essentially. So the idea was as naive as it sounds today, I'll ask you how do you see, you'll tell me, and I'll program it. I'll ask you how do you walk, or how do you speak, or how do you understand language, or how do you do whatever you do, and of course since you do it, you know how to do it. And all you need is just a list of commands that will tell us what to do and we'll program it. This was the naive AI as I call it today. I mean it was all based on logic. I mean essentially everything is either true or false. There are no intermediate phases in anything. I mean these are mathematicians that come from the old computer science where probabilities were considered out of the game. We do logic. Now as I'm sure you know, I mean so expert systems were essentially big lookup tables. I mean in this condition do this, in this condition do this, and so on. Just a big list of commands, you know even for doctors who are diagnosed diseases, I mean what's supposed to be experts and what they do, I mean this approach is wrong. I mean no one can tell you exactly what you do. I certainly not list it in a long list of commands. And this naive AI, I mean expert system or lookup tables, obviously today in retrospect failed. I mean it didn't bring us too far. I mean you know the state of the art, I mean in the 60s or early 70s even there were people, really the leaders of artificial intelligence like Marvin Miski at MIT or others who thought that computer vision is really a very easy problem. We'll give it as a summer project for two graduate students and certainly solve the problem of how we see or how we recognize objects and so on. This was really the state of the art until the 80s. Now in the 80s, more or less in the early 80s, in 1982 and 1990s, between 1984, there was really a big shift from what we call artificial intelligence to what we can call today machine learning. And machine learning is all about statistics, as you all know. So essentially forget these logic machines that know everything and have only true and false values to variables. Think about probabilities. And so it took machine learning, artificial intelligence from this logic phase to the statistics phase which was essentially about inference from probabilities of from uncertain data, the rules and instead actually programming the rules by asking an expert we learn the rule by essentially accumulate data and the program is not on the rule itself but what we call now the learning rule or the meta rule I mean so the machine is now designing or generating or learning or estimating whatever you want to call it inferring some sort of a model and this model has usually parameters or you know if it's not too large or sometimes you call it non-parametric but in principle instead of learning the rules we learn a class of possible rules and then just adjust parameters. Now this was machine learning so it moves from logic to statistics and this was essentially actually in the early 80s, 82, 83, 84 I was still a PhD student then there were two seminal papers that really influenced the field tremendously. One of them was in 1982 I think or three or four or two by Les Valiant, I think it was 84, a very prominent computer scientist who wrote a very elegant clean paper called The Theory of the Learnable which was essentially focus the attention on this from lookup tables or from sensory list of rules to what we call generalization today. Do outside of the training data and can we actually find some rigorous as much as possible distribution independent bounds on generalization on the error outside of the training. This is something I'm going to talk a lot about today. This was one very important paper because I think it essentially brought machine learning or brought statistics from engineering departments who did it a long time ago, estimation of parameters and things like this to computer science. The other interesting paper which many of us forget but for me at least was very important was in 1984 a Hopfield model. I mean John Hopfield introduced a neural network model which I'm going to mention at some point of memory of associative memory which had a very interesting flavor. It was some sort of an energy landscape with attractors that could take eventually after learning from random patterns or random introduction of patterns through something which we now call the Hebrew or some sort of synaptic connections which are getting stronger or weaker depending on the data. It generates an energy function and then eventually the recognition or the memory phase and the dynamics into local attractors of this energy function. This was very appealing especially to physicists because it looks very much like an Orin Hamiltonian system with some sort of multiple minima and both the dynamics and the study of the space of these connections and so on really introduce a flurry of activity in the 80s and early 90s on the statistical physics neural network and eventually on statistical physics learning and that's later on in the late 80s I was a postdoc or not a postdoc anymore. I was at Bell Labs at the time and Bell Labs within 85 and 92, 93, 94 was really the best place in the world to think about machine learning because things were happening there just corridor meetings on daily basis were really fascinating because we had a lot of Boston biologists who were interested in that from engineers and mathematicians and physicists all of them thought about the physics and mathematics of neural networks and learning in general and for me I mean I came from physics, from statistical physics and dynamical systems in my PhD. This was a very attractive field and we start to think about large scale machines and again I talk a lot about what they mean by large scale and why is it so important but this was the late 80s between 86 and 92 this was really the golden age of this field of statistical physics of learning and statistical neural network then eventually it went out and decayed and surprisingly or not surprisingly this thing became relevant again in the third phase of what I call the third phase of AI which is from the late 2000 till now and still very much in the hype of this phase which we call deep learning so neural network came back neural network came back in a very interesting way I mean in some sense so they came back essentially they came back to the very original idea that Rosenblatt in the late 50s and early 60s already introduced this notion of multi-layer systems very much in the spirit of what we are now using as deep learning he called it perceptrons he thought about multi-layer perceptrons so I'm sure you all heard about one layer perceptrons but essentially what Rosenblatt thought was he knew already in 1959 that one layer perceptron cannot do much it's only linearly separable data in some sense you can separate it with a hyper plane so he already thought about many layers the problem here was that although I think he was actually well aware of it if you read his book today you find it really very revealing how much he already understood but again the irony of history that in the late early 70s actually two of his colleagues again Marvin Minsky and Pepper both for me at the time wrote this wonderful rigorous book called perceptrons which essentially argued very convincingly that those multi-layer systems cannot be trained and not that they didn't think about something like stochastic gradient or gradient descent and back propagation as we call it today but they ruled it out immediately in the book as something which is going to get stuck in local minima very quickly and therefore has no chance of actually working and surprisingly this dominant argument although as I said completely rigorous and entirely wrong in terms of the conclusions killed the field essentially until the 80s where you know mainly psychologists like Rommelhardt and Hinton and others as there's enough skin few others were in this PDP group essentially revived it and that's exactly what started the whole thing of neural network in this recognition of what we now call the connection is time in the 80s and then surprisingly the whole thing went down again in the 90s essentially due to the work of one person Vladimir Vapnik who came from Russia at that time with re-bringing some old statistical ideas that he developed actually in the 70s and eventually brought it back or took neural network into a very interesting phase which we now call kernel methods or support vector machines which are essentially two layer two hidden layer neural network but very rigorous and first bounds performance and all this notion of margin and generalization bounds are really very elegant and during the 90s mostly in the 90s until the mid 2000s the whole field of neural network was out of the game I mean you know people at NIPPS could not write papers with neural network in the abstract because this would mean reject immediately or something like this it was really the case I was there now surprisingly as I said in the late 2000 one of those stubborn guys like Jan Lacoon and Jeff Hinton and some of the students eventually brought it back and they brought it back in this form of very deep networks I mean many many layers and there was not it wasn't as clear why this should make any difference I mean we already had theorems that told us that the few layers or even one hidden layer is enough in principle but we knew that the computational complexity is very hard for one hidden layer but it wasn't really clear why putting many layers and now we are talking about really many layers thousands or even ten thousand layers in the modern machine why this should make a difference but it made a difference and eventually in the late 2000 those deep neural network which are nothing more than the original Frank Rosenblatt perceptron but trained with back propagation or with stochastic gradient descent started to win every competition in pattern recognition starting from image recognition and then speech recognition and then any other problems essentially and of course there's all sorts of variations where maybe the most sophisticated ones are what we call deep RL and deep reinforcement learning or deep system with feedbacks and so on so essentially this element of deep neural network is now at the core essentially there's no real competition of every maybe there are different networks or different pieces of the network there can be an actor network or whatever in our RL there can be many other things but this is the universal learning device today with all sorts of variations which I'm going to mention later on like typical resonance or skip connections and so on so there are variations on this theme some of them are very interesting but conceptually this is the game now okay so this is it sounds very good because in a real sense it's brought AI back to life I mean so AI in this deep learning phase is doing better than ever and suddenly things which I thought will never happen like continuous real time online speech recognition speaker independent like we all do with Siri and Google speech and so on or I mean I worked on speech recognition for some time in the 80s and I thought this is going to be a very hard problem and I'm not going to see it in my lifetime or very satisfying object recognition face recognition you know so this deep networks are now everywhere whether you like it or not and they are taking over things like control I mean driving autonomous cars or controlling large scale diseases or communication and whatever so it sounds very nice this was actually a very big surprise to a lot of people I mean so essentially you heard about it this yesterday but I'm sure you know about it so essentially the idea is really very simple you just put the image or the pattern at the beginning you look at the output and then you back propagate the error layer by layer using the chain rule of derivatives and eventually and you adjust slowly small changes of the weights of those connections between the layers I don't have to explain what the neurons are I mean those are linear social functions with weights I'll come back to this in a second and eventually this really very naive algorithm just gradient descent and you don't even calculate the gradient exactly you calculate the noise diversion of the gradient because of something I'm really going to get into very carefully later on so it's something like the stochastic optimization problem that Bert just talked about but it's not really because the gradients the noise is not uniform and it's state dependent and so on but there are many many interesting things there which and somehow it works beautifully I mean we are getting to human performance let's say on object recognition or better we are getting to human performance on speech recognition we are getting to human performance on some control problems and so on so this is this calls for a theory okay so okay I just want to mention again that the neuron itself is this very simple linear threshold function which means I'm actually taking let's call them the inputs or the features of the previous layers the output of the previous layer multiply them by some vector of weights there is always what we call the bias or the offset term which is very important because it tells me where the zero is and actually it has a different scale I mean it scales like all the other layer and weights together but this dot product is essentially just going through some sort of nonlinearity which in most in the original cases was a sigmoidal nonlinearity this saturated sine smooth sine function today we're actually using things like values or what we call piecewise linear functions which actually give us very simple derivatives I mean think about values it has only two values of derivatives zero and one so it's a very crude a very crude derivative function and this still works very well actually better than many other things so the question I'm going to discuss in this series of talks I never had so long so much time I really want to try to do it slowly so I'm in the first lecture today I'm going to discuss learning statistical learning theory what I call the old statistical learning theory which is essentially the valiant what we call PAC models or probably approximately correct bounds how many of you heard about them well very few so I'm going to talk about it slowly and the PAC bounds usually physicists never hear about it and some computer scientists do but and I'm going to modify I'm going to eventually modify this bound and argue why they are not suitable for this type of learning and this is what I call rethinking learning theory so how much how much a large scale learning really should change our classical view of things so this is the classical PAC bound I'm going to introduce something we already knew in the 80s which are distribution dependent bounds I argue that once they want to go beyond PAC bounds and I'm going to argue that PAC bounds are essentially useless and they get more and more useless the larger the problem becomes because the worst case cannot be controlled it's too far away so we need to control some sort of typical case so I'm going to argue a lot about this typical behavior what do we mean by typicality what do I mean by typical learning problem and why this is enough this is going to take me okay and again I'm going to talk about it in some sense in the thermodynamic limit this is something I still I still feel the system for the same reasons exactly the statistical physics is so useful in modeling large scale system but you know the large scale there is the Avogadro number I mean the number of particles in your glass of water or whatever it is the number of molecules this is really large but I'm going to argue that even in a much smaller numbers order of ten actually order of hundred this this large large large limit large and limit is already becoming very important to dominate the story completely even for very small networks and this is going to take me to this I need I need a lot of concepts that some of you may know like entropy in the channel sense or mutual information okay the vergences so I just wonder again just to know how many of you know what mutual information is okay all of you how many know what great distortion function is I'm not sure okay so this is given the scale so and how many know why entropy is so important I mean the typicality that what we call the asymptotic equi- partition theorem how many heard about it okay so I'm somewhere in between I know that you heard about mutual information but you don't really know how is it used in information theory okay that's the state of the art that's good so I'm going to spend today some time motivating I'm not going to be a crash course information theory this will take too long or it will be too compressed and as you know there is a trade between compression and understanding so I'm going to go slowly talk about the entropy and talk about mutual information and talk about the KL divergences and why they're so important and then I'm going to go to my main new result here which I call the information plan theorem which is essentially trying to argue that when you start to talk about very large systems somehow surprisingly the probably very complex system of tons of parameters millions of weights actually is actually reduced to two numbers which are really important so this is very similar in spirit to equilibrium dynamics where you know that eventually it's a trade-off between two functions the energy and the entropy of the system and if you know this you can calculate whatever you want if you know what trade-off is so something very similar but still different happens here in my opinion and this is the two values of information which are going to be important for the rest of the discussion this is I'm going to talk about the behavior of neural networks in this plan and motivate the questions of the next part which is tomorrow morning which is after we understand what's going on in this information theoretic view of neural networks I'm going to bring dynamics back so actually I argue that information functions alone cannot really give you the really interesting story because there are two invariant to permutations or to run to ones and summations of the variable they're completely oblivious to them therefore I can actually encode in the information plan alone I don't see the computational complexity at all it's not there in order to see what's happened in time I need to go back exactly to the topics that Bert and maybe others discussed here which is this stochastic training dynamics and there's in particular what we call stochastic gradient descent or SGD and I'm going to argue again in the spirit of the Fokker-Planck equation whatever you want that usually two terms in this dynamics a drift term and a diffusion term but they are completely imbalanced in the first part of the training the drift is completely dominating the story and in the second part of the training which happens very quickly actually after a few epochs of training the diffusion dominates the story so most of the training of deep neural networks according to this view is happening due to the diffusion of the random walk or this winner processes that happens to the weights when you train them with stochastic dynamics and I'm going to show you some very nice interesting to prove or to at least numerically convince you that there are indeed two phases to the dynamics one of them is mostly drift and the other one is mostly diffusion we're going to look at the weight scale with time and as you will immediately recognize it I mean linear dynamics is going to have a linear growth of the weights and diffusion is going to have the square root of T type of growth of the weights because it's a random walk and this is very clearly observed in all the networks we looked at by the way a lot of people could see this has nothing to do with any information processing or anything and then I'm going to prove a bound which I call the Gaussian bound on dimension information and I'm going to use again constants for information theory and for this I need the concept of a Gaussian channel which I'll discuss slowly and introduce here so this is going to give us a very interesting bound on the mutual information between two successive layers of the network and then I'm going to give you really one of the highlights of the theory which is the theorem about the computational benefit of the layers so if this is true that everything is dominated by diffusion then here is suddenly a good explanation I like why many layers help you I'm going to prove that in this case the time that it takes to converge to good solution actually goes down with the number of layers which is very surprising and very non-intuitive and so this is going to be the end of my second talk and the third talk hopefully I'm going to elevate the whole thing to a much deeper physics if you want to think okay so if this is the story where are these layers really what do they represent I mean in some sense what do they tell us do they have any differences what is the this is what I call opening the black box I mean what do these layers actually encode and I'm going to use a little bit of group theory and phase transition understanding and so on in order to really give you some sort of insight this is completely new and unpublished you'll give it for free okay no problem so when you talk about learning machine learning in the second phase of machine learning what I call the statistical phase what you should all have in mind is this type of picture so this is curve fitting so you know when you have an experimental data let's say these points here any one of you did any experiment in physics and the model of physics you get data in a plan x versus y or y versus x and you are asked to approximate it with a function okay so this is what we usually do I mean so if this is my data I mean the first try is approximated with a constant it doesn't look too good so the first the second try is to let's say put a linear function here this is the first thing that any any physics will try it doesn't look too convincing okay so what we do we go to higher order polynomials okay so we get let's say to quadratic or third order or maybe very very high order polynomial you know that in the limit if I take the polynomial to be of the number of points minus one or plus one I can fit all the data exactly okay so that's great I mean I have a very nice model that will fit my data perfectly so what's wrong with it you're not generalized but how do we know okay I mean so what's wrong with this so essentially this is really the main lesson of statistics or you know the early phase of machine learning you shouldn't over fit okay you shouldn't try to move through all the data points because this will be too complex of a model that can make sense I mean okay it will be a perfect interpolation for anyone of you who learn numerical analysis I don't know sometimes though there was you know those Chebyshev polynomials that can fit the data perfectly with a perfect alternate and so on but they get crazy everywhere else I mean that over controlled inside my data and completely wild outside there's no control of what happens to these high dimensional polynomials so this is what we call overfitting and because we actually think about generalization which is the error data so a new point here will get wild in general with this high dimensional polynomial because you know they oscillate so widely between the points that the chance is actually hitting another pointer so if you look at this data you know you would actually say that a reasonable model is something like a sinusoid I mean there are two clear oscillations here and the rest is probably noise okay so that brings another very important point so first of all we know when we know rigorously from all sorts of theorem that if you really want to generalize well the number of data points should scale like the number of parameters let's say if it's a polynomial is the number of coefficient or the degree of the polynomial okay so that's a rule of thumb it should be a little more than that so in order to estimate the parameters you need a few more points to actually estimate some confidence okay so that's the first rule of learning the number of data points should scale is the number of parameters unfortunately deep learning seem to violate this rule dramatically I mean we have millions of parameters and only hundreds of thousands of examples this is really generally the case so something is really wrong with our ascending of fitting so curve fitting is not what deep learning is doing at least not in an naive sense now you saw already that if instead of polynomials I actually use a sine function here I can actually fit this data very well with very few parameters maybe just two you know three that the amplitude the phase and the offset of something I don't know and so this is a sine with two parameters and the frequency of course yeah so three parameters can fit it just as well as the polynomial very high degree so the choice of the right class of functions in which we search for the approximation is very important and that's something we usually call in learning the hypothesis class so I want I want to of course if this is too wild for example if I allow my sinoside to be with unbounded frequency so then again I can I can fit this data very nicely with one sinoside with only one parameter so something has to be because you know I can oscillate between the points or very very very high frequency and this will obviously it obviously says that it's not just the number of parameters there's something else there okay so so before I go on yeah of course deep learning also goes with this notion of deep of big data so we know that unlike classical learning problems classical learning algorithms usually have finite dimensionality in some sense in a sense that I'm going to describe in a second so they saturate eventually when I add more data I mean the performance doesn't improve with the data deep learning algorithm of this seems to behave like this you add more data you get better models this is of course why adding millions of images or hundreds of millions of images seem to improve the recognition of Facebook or whatever and this is there is a reason for this because they don't really have a finite dimensional model and they manage somehow to adapt the complexity to the data without overfitting or at least without overfitting as much as we could imagine from simple models okay so these are going to be the first questions we're going to answer or to ask today and so maybe before I go to the chalk talk because before starting to prove things and bounds I just want to give you again the outline the high level picture first so and then I'm going to come back to this picture again and again so those neural networks have the following structure essentially so and again I'm going to open this box a little bit so maybe even before going into details of this they have the following structure the following form so there's input or inputs which are essentially I'm going to call them Xi unless otherwise stated and those Xi belong to some variable which I'm going to call to denote by capital X so those Xi thinks about them I mean for the rest of my talks these are let's say images they have I don't know 10 megapixels images of objects let's say faces and my task is to recognize or let's say to say even just very simple thing is it me or not me in the image okay is it the picture of me what's the picture of me so essentially the output of this black box is something which I call the label of Y and in general I'm going to belong to I'm going to denote by capital Y the variable from which this is sampled okay and now we call it a black box because something very strange happens inside and I'm actually going to open this black box so essentially what happens here is that the input is mapped into something so there is this Y which I'm going to put here for some reason which is the the desired label so for every X there is a Y which tells me this is my picture or not my picture okay so you can label pictures like this so essentially there's some sort of a distribution underneath P, X and Y from which we actually sample our data so I actually get layers like this X I and Y I which are distributed according to this P X 1 okay so this is the way we usually learn we sample this input distribution and then get a label for each one of them and then something strange happens in between and eventually I get let's call now this Y hat which is my real output of the system the reason it's not Y is because I'm not sure that this is what the system is going to generate my goal is to get Y hat and Y as close as possible to each other in some sense okay but that never happens exactly it's going to happen in some statistical sense and that's what we want to guarantee and somewhere in between in between this something happens here let's call it T or there are those I'm actually going to use several notations for this I'm going to call it X hat or sometimes I'm just going to call it the hidden layers the set of hidden layers so just like here you see X the data which is samples from this joint distribution of X and Y I get a finite sample it can be very large but just a finite sample it's usually much smaller than all possible images or something like this which is essentially infinite and I'm going to get this finite sample so I'm going to get something like N samples here let's call it M samples here and those images are going to go to a transformation layer by layer which I'm going to call the representations of the data so each of those H I it's some function of X so essentially what's happened here is some sort of internal representation not all people like this term which can be one layer or many layers which are essentially telling me inside this box the images are going to be scrambled and transform in a very interesting way which we don't really understand but eventually they're going to generate this Y hat so this is the picture I want you to have in mind and I'm going to use essentially an information theoretic in terms I'm going to say okay you know in general T is a function of X but even if there is some division between if it's not really a deterministic function maybe just a stochastic function this is some sort of an encoding so I'm going to use the term encoder for this condition distribution how is the representation this internal representation representation is going to depend on T and I I'm going to for a good reason which I'm going to explain later on think about it as stochastic map if it's deterministic then it's a special case of a stochastic map it's some limit of a stochastic map so this can be stochastic either because there's some noise in the map or because I choose to look at it in a think about it as some sort of many to one map maybe for example by interpreting the sigmoidal functions in the units as probabilities this is actually what we usually do we think about the units as probabilities of something so the probability of being one or zero or minus one if I think about this arc tangents or this hyperbolic tangents or this sigmoidal function as a probability then it's although the network is deterministic I can think about it as a stochastic map okay is this clear and then eventually from this internal representation inside the box which in the case of neural networks if deep neural networks have many many layers and actually not one representation is a cascade of sequence of representations and which are related to each other I call this the decoder which is how I actually calculate the output from the representation so this is the picture I have in mind when I talk about representation learning and towards the end of this lecture I'm going to give a very strong statement about the nature of the problem in terms of the properties of this encoder in the code but actually I'm going to use so given a representation I'm not going to use the actual decoder of the network I'm going to use the optimal decoder the one I could get if I was a god or had access to all the data this is what I call this is the base optimal decoder so think about it for a second if you have any representation there is a map not from y but back not to y hat but back to y which is the best thing I can do now one thing just about notation when I write things with arrows like this I actually use the graphical model representation so this is the Markov chain which means x and y are related to each other in such a way that given x I can calculate y so maybe this goes in both directions actually in this direction but there's a Markov dependency here so it's p y given x which really matters and then if I have distribution of x I know also p of x so I have the joint distribution so it's p y given x which tells me how the label is determined given the pattern and x is something which I'm going to call the pattern the input pattern and I'm going to think about okay the map from the pattern to the representation and then from the representation what would be the best possible inference of the label given x and this is what I call the optimal decoder or the base optimal decoder and I'll come back to this explicitly and eventually I'm going to state a theorem which tells us look everything that really matters are not even those encoders and decoder but one function one number the mutual information of the encoder and the mutual information of the optimal decoder which are going to tell us the whole story so this is where I'm going now usually when we think about learning the issue is generalization so what we essentially have in there so I'm going to talk about pack bounds and pack bounds are probably approximately correct bounds as I showed them in the original view to valiant in 84 but actually they go back in statistics many years before that and if you actually see the trace of them in vatnik from 71 and even further now the pack bounds are talking about something which we call the generalization gap so imagine that I have an hypothesis class so I'm going to write it as usually down so age is a class of function which can be the polynomial for example of some degree let's say of some function with some bounded function or anything else that you can function on a sphere which can be approximated by some finite number of spherical harmonics or whatever you name it as long as I can bound the complexity in some very strict sense that I'll talk about in a second so this is a class of functions and age are functions from y so let's say if y is just a beat these are Boolean functions of x and the pattern x is this so you know so for every x let's say one of my examples I can look at the distance or the square distance something like this between how well my function is approximating the true label and this is I mean it doesn't have to be a square distance it's going to be a much simpler thing usually I'm going to move it to something which I'm going to call the error of age on xi so it can be something like the square error it can be the absolute value it can be something else so whatever I want this is going to measure how close we like square errors because it's easy to take derivatives and so on but we don't have to and it's going to measure how close my age is approximating the label okay so let's call this the error error of age on xi and of course it's a function of both xi and yi okay so now imagine that I have m examples which are randomly drawn and actually assume that they are independent and identically distributed so they're randomly drawn from this joint distribution of xy which is somewhere there it's part of the world I don't have to do it but I'm drawing so essentially what we really care about is how well at least learning theory cares about is how well I can approximate my real error which is the expected error by the empirical error which is the error on the sample so let me write it carefully so essentially if I have independent samples I usually like to sum my error okay so this is going to be the sum of the errors on my sample so I assume here that I have a size a I'm going to denote it like this I mean so I have a sample which let's just write it as x to the m okay which is essentially an independent samples of labels and patterns okay so this is the sum of the errors if I want to see the average error what would they do yeah so the empirical average is one over m okay so this is the mean error on my sample now what I really care about is the difference between this and what I mean what do I really care about the expected error I mean you're not yet with me I'm a little worried so you want to average it with the expectation of this error respect to all possible ages in some all possible x's sorry which I'm going to denote just by the expected error over x error of age over x okay so essentially the difference between this and this and I want to bound that absolute difference this essentially the difference between the empirical error empirical error or sometimes we call it the training error and so this is the error on the sample and this is the expected error which means if I had all the data this would be the expectation of the empirical error okay is this clear this is what I really care about because I want to minimize the expectation I want to find an age which has a small error on average on all the data sometimes I also want to minimize the variance but that's a different story so this difference between the empirical error and the expected error is usually called in the linear literature the generalization gap okay let's call it gen gap just because I'm out of space here and so this is the difference between the empirical or training error and sometimes this is actually called the generalization error so I know that many of you know this but you don't respond to my question so I have to go slowly so essentially this is the generalization error okay so how can I what do I know about this difference yes no no so look okay good it tells me that I need to be a little more careful so essentially I want to keep this for a while but so essentially the story is the following let's say that this is the error and this is H so in general this is a multi-dimensional high-dimensional space but for lack of many excesses on the board I'm drawing it just as one line okay now let's say that my rule is somewhere here let's call it H star so actually it can or it may or may not be in my hypothesis class so let's say that H star is the closest possible function in my class to the data so this is going to be the best approximator remember my points so none of these polynomials actually go through all the points I didn't care I wanted a good approximation in this case I wanted a good approximation in the square or something like this this is what you do all the time so let's say that the best approximation is H star now if I look at the error outside so usually it will look something complicated like this okay I mean this is really a rough approximation so this is the expected error on my data which means this is the error average of all possible points including these that I don't see including the out-of-training data that's very important so usually if my age is really the best approximation in my class it will have the minimal error okay there may be by the way a lot of ages like this I mean actually in my complicated models like different networks I have a lot infinitely many good models all equally good more or less and we know that this is the problem it's not one but let's just a second listen think about one now this I could so okay so the problem is solved I mean just minimize the expected error and find this point in whatever I have you but the problem is that I don't have this function why I don't have this function because I don't have all the data I have only a sample okay so again I need to do something else the only thing I can do is approximate this function and I approximate it by the empirical error because this is something I know so for every age here I calculate the empirical error so let's say I need another color let's say that the empirical error actually looks like this which is not actually it can be much noisier so this is the empirical error let's call it 1 over m sum over e of age on xi okay so this is the empirical error why is it much noisier because it's a random sample and some maybe I get zero error here I don't know it can be negative but let's say I get zero error here it doesn't mean that this is the best one so what would I like to have so this is really the fundamental property of machine learning or classical machine learning I would like to do something that actually minimizes the empirical error which is something called empirical error minimization or ERM I'm actually not too far for minimizing the expected error so what I want is a guarantee that this is not too far from this if this is not too far from this let's say I get to this point instead of this point they are close enough so for this I need something which we sometimes call uniform convergence or things like this I need to develop that these two functions are uniformly not too far from each other uniformly I mean for all ages I can bound the difference if I could bound the difference at least in probability bound the difference then I'm fine okay so this is my goal this is the goal of valiant I want to bound the difference between the empirical error and the actual expected error and I want to do it uniformly for all ages if it's uniformly bounded I know that the difference between them cannot be too large anywhere then minimizing the empirical error will not take me too far away from the actual best function this is the main idea if you're confused I'm in trouble I optimize over age yes I want the valiant to be uniform so it's not enough that's what I'm saying for all ages this is what we call uniform bound then I optimize the blue line you're absolutely right you're just proceeding me let me go slow I'm going to do that okay this is precisely the point I want the uniform bound because I want this minimization of the blue function to be close to the optimization of the y function but they're not the same function for example something to be a disaster that this is the y function and this is the blue function completely uncontrolled this completely uncontrolled I can get this point here or this point here and this will not give me a good generalization okay so if I can put it in an envelope which is bounded I mean I know that I can bound the difference between the blue and the white uniformly for all ages then I'm fine this is what we call uniform convergence bound okay so this was the spirit or is the spirit of learning theory this is this idea dominates learning theory mostly until today and I'm going to argue first of all I want you to understand it and then I'm going to ruin it but before I ruin it I want to convince you this is a good idea okay so this is called this idea is called empirical risk minimization empirical error minimization ERM ERM tell me look if you minimize the training error and you have uniform bound on the difference between the error in the sample error and the expected error then you're okay because this difference I can actually bound this difference this is can bound this exactly by some the width of this tube of this envelope and something in the actual error so that's actually very easy here you're preceding me let me just a little bit further and then ask the questions yes no no no this is true for any M I want an M dependent bounds so essentially the question becomes a classical statistical question I want to minimize the probability that this value is let's say greater than some epsilon this is really what I want if I could bound this probability to be too large then I'm okay then I have this epsilon tube and then I'm fine okay so what is the probability so now I'm using something which Bert mentioned the other mentions this is an empirical average and this is its expectation you see this average this overall I just get one over M the average which is M it's exactly the average okay M times the average and the M constant okay so if I just take the expectation of this this is this okay this is clear so now why does it why can I bound it yes that's right this is just like saying that with high probability which I'll mention this is going to be smaller than epsilon okay absolutely then I'll minimize that empirical error and I'll I'm down okay but first I need to guarantee this I want to minimize the generalization gap and then of course I'll take an optimum over all ages if I can okay yes this is the probability distribution over all patterns and the expectation is respect to the axis of course we don't have all the patterns but we have a sample yeah so we have the empirical we have a sample of I of M patterns yeah the second time is something we don't have but I can still bound the difference why can I bound the difference under some conditions okay so in general if I'm talking about a fixed age one age anyway any age the probability that this is that the empirical is close to the expectation what do you know about it so this is the you know the standard convergence of empirical means that was mentioned many many times essentially I know that this is a sum of IID numbers independent numbers because the X's are independent and therefore this is converging to a Gaussian distribution I just said it very quickly and this Gaussian has a width which scales like square root of M but if I divided by M it squares like one over square root of M so the width of this Gaussian so if I look at this empirical distribution of this empirical mean it's going to look like a Gaussian around the empirical error with width so this is over the samples yeah of size M with width goes like one over square root of M okay so that's very nice this is allowing me to use something which we call the chain of bound so if I want the probability that I'm too large that this difference is more than epsilon I need to bound the probability that I am large is the width of the tails of this Gaussian so essentially if I take M for any epsilon if I take M to be large enough the Gaussian is going to concentrate around the mean and I can bound the difference so essentially what chain of bound or things like chain of bound I'm smoothing some assumptions here that the error is bounded and so I'll never mind that is less than something which looks like e to the minus 2M epsilon square how many of you are surprised by that so this is one version of the chain of bound the coefficients can vary depending what exactly I'm talking about one side and two side that how much I bound if I talk about the error in both dimensions of whatever but this is the essence of it and by the way the fact that this reminds me of the exponent of the Gaussian this is indeed the case I mean epsilon square is just the X minus mu square of the Gaussian and one over 2M square is exactly the variance okay so it's just the abscissa of a Gaussian distribution at this point and I want it to be small enough and actually it's a bound if you don't know on the tail so this is and I think about Gaussian there the error function is bounded by the abscissa itself up to some constants and so this is the bound this is called the chain of bounds I know that people think that they know it but they don't know it so I'm saying it explicitly this is the fundamental bound here and it's all very simple I mean I have an empirical mean I want it to be close to the mean and I know that when M gets larger this is going to shrink like square root of 1 over square root of M okay so that's very nice because this is independent of anything it depends only on M and epsilon now I want okay so that this is still a positive number so I want this to be small I want the probability that I am too far away from my mean to be bounded so I'm going to make this smaller than some delta okay but that's not good enough why it's not good enough because this was just for one one H well what happens with other H's in my class I need to bound all of them so I need something slightly different I need the probability probability that there exists an H in my hypothesis class such that this is true the empirical error minus the expected error of H is larger than epsilon I want this probability to be smaller than delta this will be nice because this will give me this uniform convergence I want that there are no H's in my class that are bad which means that they don't they are not controlled they have larger generalization gap than I want okay so how do I do this so there's a very simple trick okay this is absolute value here sorry this is just a condition on this so how do you so this is the probability that H1 is not good it's not bad and H2 is not bad and so on it's or I want or on many many events so this is the probability of what we call I want H in this class is good okay so this is bounded by something which you call the union bound so in principle if H is finite this is bounded by the cardinality of H times the probability that one H is not good so times e to the minus 2m epsilon square and this I want to be smaller than that now what is this cardinality of H this is a tricky thing yes that yes you're right absolutely no such that this is larger than epsilon the probability I'm going to bound later on it's alright this is for a given sample so this is the probability over samples you're absolutely right I have to be careful here so for a given sample it's just a number okay so the probability that for any sample there exists one which is larger than epsilon okay so this is the whole trick this is called the cardinality bound and of course it makes sense as long as my class of function is finite okay so this is known as the cardinality bound cardinality bound and it's good as long as H is finite but what okay and what I get from it is a very nice thing I just take the logs so I want this probability that there exists an H for which it's epsilon bad on my sample with high probability the probability is again because my samples are random and I can have a bad sample or a good sample most samples usually good but I get a very large sample so this gives me to take the logs and you see something like this immediately that epsilon square is less than log cardinality of H plus log 1 over delta divided by 2m okay this is simple algebra I'm not going to do it here take the logs and rearrange so this is a very interesting bound this is why people like it because it tells me look the only thing you really have to worry about is that the log of H plus this log of confidence we call this the confidence so we want this to be small I mean what is this I guarantee that my sample is not too bad okay so delta is essentially the probability I get allows this sample can always happen by the way when I go to very large data or very large problem again this probability is going to be negligible when the probability I see million images of me which is not typical is very small I can forget about it okay so now I want to talk a little bit about this bound is this clear this number I mean this was a union bound so this was a sum over all ages I skipped one line here this I can bound by the sum over all ages in age in principle because this is what the union bound which means I'm approximating the probability of union of events by the sum of the probabilities okay so this is true if they are disjoint it's a bound okay so this are not disjoint event in general so that's why it's a bound of this number e to the minus 2m this is because this bound was true for any age but this is simply the cardinality this is simply the cardinality of age e to the minus 2m epsilon square and this I want to be smaller than delta okay so this is it this is the proof of the theorem there's one problem here okay so that's actually very good as long as my cardinality somebody asked me what is the cardinality so essentially it's the number of functions so obviously let's say I talk about polynomials so even with constants or with one there's other polynomial it's not a finite class okay it has two parameters so somehow I need to do something else and this is called the covering or the quantization of the class so instead of let's say if age is finite then that's a very nice bound well behaved I'll show you that it's useless even as it is but if we have an infinite class so for infinite ages what we do is we use a very important trick which is called an epsilon cover of age okay so what I mean by that is that imagine now that age is a compact manifold it's a compact space in RM so if age is compact and I don't want to get into too much of the mathematical details here which means yes sorry I can't hear if age is infinite this bound is useless okay yes no no I actually didn't say it explicitly but it was assumed here that this sum is finite okay but of course if this sum is infinite then I cannot use this bound I mean this is going to be log of infinity and it's not going to tell me anything whatever you see here that this bound become meaningful if an M is larger than this cardinality of age forget about delta so essentially it's the log of the cardinality of age over M which is going to dominate my epsilon I want it to be small otherwise it doesn't make any sense now if age is infinite this is not a useful bound but there's this nice very nice trick that if age is compact in the mathematical sense this means that for any epsilon for any epsilon cover I can find a finite cover so let's say that age is a two-dimensional manifold somewhere so this supposed to be a piece of two-dimensional manifold it's a compact means that it has a finite cover for any epsilon so let's say I want to cover it with spheres of size epsilon no matter what this metric is as long as it's more or less a metric so the number so this the size of this of this each of these spheres is epsilon which means that within this sphere the difference between the arrows is bounded by epsilon so essentially what it means that this epsilon or this metric is the probability of disagreement of two hypothesis so my metric here is that the age one age two is the probability that of what we call the symmetric difference of age one of age two under my rule so under my whatever probability of my data is the probability that I find that point where this two are disagree so this is the symmetric difference so if I had let's say if my hypothesis are let's say images in the plan and one is inside and zero is outside this is the symmetric difference this is the place where they disagree this is age one and this is age two okay so the probability of this agreement is a very natural metric here okay so in principle such a probability exists I don't even need it I can just talk about how many epsilon spheres or epsilon balls like this I need in order to cover my space now many of you may have heard about let's say the house of dimension or the dimensionality of a manifold so in general what we know that if the number of spheres that I need of size epsilon to cover my space scales like one over epsilon to some dimension so in one dimension let's say this is my manifold how many spheres I need essentially one over epsilon in two dimensions as you I hope can imagine you need one over epsilon square yeah so this is the dimension of the one the theme of the manifold in which age lives now okay so it can be the dimension of the parameters for example it can be the dimension of some other smooth embedding of my hypothesis okay so that's an interesting concept so this is very much let's call it the topological dimension so if there exists a D like this when epsilon is very small so essentially D is really the limit of epsilon goes to zero of a log of n epsilon divided by log of one over D one over epsilon sorry and this limit exists in some sense then this is a finite dimensional manifold at least it has a finite household dimension and actually this goes enough for us I mean this dimension can in other context it's called the VC dimension it's very closely related to it yeah so the VC dimension is something very similar to this although usually it's introduced in a much more complicated combinatorial way but D can be the VC dimension or the household dimension and so on there are many many other types of notions which are all similar up to some constants in the exponent yes no it is related yes so actually what I want is to replace this by the cardinality of an epsilon over two cover and then if this is finite then it's enough to find one of the elements of the cover and use it as your function and then the error is going to be bounded by epsilon square and another epsilon square I lose from there so I want another factor of two here maybe to have an epsilon here so if this is epsilon square I am fine actually so there is you know I just cover my space by sphere of epsilon two take one of the closest one in this cover as my hypothesis this will give me at most epsilon two additional error so I bounded the other one by epsilon two and I get an epsilon approximation so that's good enough okay so that's the whole trick I cover my hypothesis class by I do what we call quantization of the hypothesis class it's a finite quantization then after I can use this cardinality bound it's going to be a function of epsilon so if it's going to be a function of epsilon let's say that it behaves like one over epsilon to the d with some constants so the constants say I don't care about constants because eventually I'm going to take m to be very large and I'm going to so there are some constants which are geometric constants I don't care I replace these by one over epsilon or two epsilon never mind to the d two over epsilon sorry let's say two over epsilon to make so this goes again like one over epsilon one over epsilon to the square but this is nice I'm sorry it's the log of this it's the log of the cardinality the cardinality goes like one over epsilon the log of the cardinality but this is very nice why is it nice because this then I have a bound which looks like epsilon square is less than d or d over m or whatever d over m plus constants this looks like one over delta and things like this and two m here so this is going to be dominated so the d goes from the exponent outside I'm sorry d over m log one over epsilon d log of one over epsilon or two over epsilon it doesn't matter as you see I mean we don't really care about constants because we're already thinking about very large problems but if you really want the constants work them out so this is my bound goes to something like this and this is nice because it tells you that m the number of examples would scale like the dimensionality of my class so it's d over m as long as m is smaller than d this is not a useful bound month m is larger than d this becomes a useful bound ok so this is the nature of learning theory of the pack bounds how much time yes yes yes yes so it tells me that if m is larger than this over epsilon square or something then I'm fine I mean you have to work out the logarithmic correction here it's not important ok this took me way too long but it's important so now I'm going to argue that when you think about neural networks the dimensionality of the parameter of the class I mean with all these weights is in the is orders of magnitude about the number of examples this is a fact ok so we need to do some refinement to this bound or we need to think about different things I'm actually going to argue notice by the way that what is the hypothesis class is extremely important here so I'm doing it in a distribution independent way I don't care what the distribution of the data is out of the game but I do care very much about the dimensionality of hypothesis class I mean what kind of functions I'm looking for so now I'm going to do it 90 degrees twist to this type of bound ok so this is pack theory this is actually very elegant and that's why it's really dominating the story so far any questions of this I'm going to raise it yes it's the dimensionality of the number of functions but now if you think about the weights as parameters of the function spice so the possible number of way of networks is having a dimension which is bounded at least by the number of weights bounded by that's exactly right but you don't see it yet so if I replace D by simply the number of parameters in the network as I said the number of weights between all these layers is going to be way too high and these bounds are useless they are in the stratosphere as I usually say I mean they are way above one ok so we need a new argument here and that's exactly what I'm going to do now this is what I wanted to do now but before I do this I need some concepts about concentrations I need to clarify some so this talk is mainly introductory I see already that I'm not going to get too far by 630 but let me try so when I say pack bounds that's what I mean now so notice that really the main idea in this bound was this concentration of empirical means it's the chain of bounds which is something that we use again and again yes please I can't hear you so the H function so think about the network what are the parameters here these are all those weights I mean all the connections between one layer to the next these are all adjustable parameters so in principle the number of parameters of these models is bounded by the number of weights in practice it behaves like a much lower dimension so let's try to see why and try to estimate this dimensionality but the only mathematics that I can use is the mathematics of large systems I mean of large scale learning so when I say large scale learning just a second I'm coming back to you this is a topic of many lectures that I give in a course in Jerusalem I'm going to summarize it very quickly yes, yes so again I mean this is an interesting trick we are not thinking about x at all the only way where x comes into the game is in the distance measure it's a probability that two h disagree depends on x and depends on the distribution of x but once I define this distance between two hypothesis as this probability of the symmetric difference then I don't care anything about h it's a distance on the h's and this is what we call covering or quantization of the hypothesis class I'm going to shift it to a different type of covering it's covering of the input class and that's your question is actually very relevant for that yes another question I try to answer every question I get no I don't have to do this because there's a central limit theorem that tells me that this is going to be Gaussian in the limit it's not an assumption it's a mathematical result as long as e's are bounded and independent in some sense okay well behaved have finite variance that's enough for me okay yes so if you're talking about simple functions like polynomials then this d this dimension is going to be essentially the number of parameters plus minus one or something yes they are using a way over complete function but it's so over complete that it doesn't make any sense anymore yeah that's essentially what's happening now okay so I guess you are more or less happy and I want to talk about so we need some mechanisms some mathematics that allow us to control the behavior of very large systems it's not new it's called a statistical physics it's called an engineering information theory these are all theories which are asymptotic in some sense and I just want to give you some intuition about it so imagine that my xi's are distributed iid according to some distribution p of x just to make it simple okay and now I'm having an m sample an iid sample miid sample again as before and I look at the probability of all the samples together okay so what do you know about it if they're iid what is that now I'm trying to get to the point where everyone answers together okay so this is a product okay this is the one of the actions of probability the probability of independent event factorizes multiplied okay so this is a product over write it here okay that's a very fundamental thing by the way it will be true in a much more general setting but what do I know about product of iid numbers nothing no but what I do know is that sums behave very nicely I know that products are so first of all this should be all known zero so I want to assume that this is strictly larger than zero otherwise if I have one zero here it's going to verify all my product that's not very nice so of course the assumption is that if you saw x i it doesn't have a zero probability because otherwise but you know when you talk about the continuous measures that come tricky so essentially this is the assumption here okay so if they're all positive and I want to actually turn it into sum what should I do yeah good okay I want to see at what point you really want okay so I take the log if I take the log this is turning into some okay but now what what do I know what sums of independent things so I just said it I mean they said they concentrate so they concentrate around what around the mean so this is going to look asymptotically like n times the average of this number this is the advantage of physicists average are very easy to write okay because again I mean for the same reason so actually it makes perfect sense to take one over m here and one over m here and since this is a negative number for normal logs I put a minus here okay for the same reasons we talked about for this integration thing of independent sums this is going to concentrate I have to be a little more careful here because these are not no longer bounded numbers and I have to do this averaging carefully but let's say that I can control that so this is very nice because it tell me that the log of the probability of a large sequence of events is going to concentrate around this mean of log let's call it mean minus mean of the minus log whatever it is which again go like one over square root of m up to some similar assumptions I have to control the magnitude of this probability but that's not an issue okay so again it's going to be a Gaussian distribution for the same reason okay so that's very nice so it means that in the large this is going to be centered around this particular average okay now what is this particular average so this particular average is simply the expectation of minus log of p of x of all possible x's how do you write it otherwise it's minus sum of all x's or integral of all x's I don't care log of p of x okay this is what it's going to concentrate around okay but you probably know this expression what is it this is what we call the entropy of x yeah so the entropy of x is a very useful thing this is a definition and it's called the entropy or the Shannon entropy okay so the entropy of x is going to dominate the probability of a very long sequence once the sequence is very long or very large yeah okay so that's very important because essentially it's telling us something which in information theory we call the asymptotic equiposition property but it's never mind it's a very complicated name for something very simple so it's telling us that in the larger limit again this I put some epsilon here around the entropy the probability of finding sequences that are far away from epsilon is going to go to zero with m so put an epsilon on the tails exactly as I did before and this is going to concentrate very nicely actually I can say very clearly in the same way exactly that the probability is 1 over m or minus 1 over m log p of this large sample allow me to write it like this just to save space so x m is this okay just for simplicity so the probability of this minus the entropy of x the probability that this is going to be larger than epsilon is bounded by something exponential in epsilon something maybe other constants here but I don't care something like this this is again the chain of bound or the central limit theorem okay so that's very nice it's telling me that for large enough m this is going to concentrate a lot so okay so this is a very interesting property of this number the entropy which we can actually calculate explicitly just if you know the distribution so this empirical log likelihood or log probability is going to be attracted very very sharply to this particular number and now I can restrict myself only to what I call typical patterns typical sequences so this is important so in some sense I want to make this sufficient to small this stays sufficiently small such that I don't care anymore about path sequences which are out there I call them non-typical so essentially I can set of all possible sequences such that so it's all the sequences set okay whatever it's all the sequences of size m such that this is true so log p of m 1 over m log p of m minus h is less than epsilon okay so again it's a magnitude okay this is the condition of the sequence sorry x of m okay the monitor can take it wherever you want yeah I'm sorry so I usually have a bow that I can lift up okay yeah so so this is what I call epsilon typical patterns okay why epsilon there's an epsilon here and what is really nice is that when m goes to infinity I can shrink epsilon very much like this chain of bound again so for large m for large scale problem most of my patterns are going to be typical actually there's something even nicer here it tells me that the probability of the pattern whatever it is is essentially e to the minus mh so all the sequences in this class have the same probability essentially because all of them this is very close to epsilon so with in plus minus epsilon here have to be careful it has this probability so all of them are equal so now I'm going to apply to my neural networks this type of ideas I'm going to do it very very quickly now I'm out of time so I'm going to essentially use this entropy typicality to characterize the layers of the neural networks so when I say large scale problem I'm going to restrict myself to typical axis typical axis means okay out of all the possible images of 10 megapixels I'm going to ignore everything that you would ignore which are non-typical so if I have a very distorted image of me and you don't recognize it I don't care if the network doesn't recognize it if it's a non-typical image forget it okay so I'm going to do everything that we did so far but only on typical patterns and I said that when m is very large this is a good assumption because I can bound both epsilon and typical very very easily okay so essentially the same line of ideas the same idea exactly can go to probability of two distributions of two variables sorry x and y or condition distributions so essentially this is going to be a very useful tool and I'm going to use it in order to bound the cardinality of all sorts of things okay so let me start doing it so again I need a little more of information theory but I'm out of time I was hoping to get through it so the first one is the KL divergence of two distributions so usually I don't even spend time on it but here I really want you to understand and I see that it's actually working so I just want to motivate this KL divergence a little more you heard about it already I'm sure even here there's actually an extremely important quantity in this asymptotic setting so for this I'm going to just take a very simple example which I'm going to generalize into something called son of theorem so imagine that I have a binomial distribution Bernoulli trials coin flips okay and I'm asking what is the probability of seeing m heads out of n trials okay just in a binomial distribution or Bernoulli what is it yeah all together it's n over m what so let's say that theta p of head is theta okay and p of tail is one minus theta and theta is between zero and one okay so this is theta to the m one minus theta to the one minus m right so this is my probability of seeing the excesses of head 10 tails where probability of a tail of head is theta so I'm going to see n minus m tails and that's exactly the probability why is this binomial factor coming from I assume you know okay now what is really interesting that the log of this and again it's a very simple exercise but I want you to do it is actually behaving like okay the log of the one minus m and n over m plus m log theta plus one minus m minus m sorry n minus m log of one minus theta but just using the fact that log n factorial so this is n factorial over n factorial times n minus n factorial log n factorial is what approximately it's known as the stilling approximation essentially all I need is this now if you don't believe this integrate one over x and get log okay so integrate log and you get this x log x minus x ln so this is a natural log now so using this this is equal to approximately and the only approximation here is the stilling approximation which is actually very good and the stilling approximation is 10% correct for 4 and 1% correct for 8 so it's really very good even without the pre-factors the n over e over 2 pi whatever that you have there for the integration this is a very simple thing so if you just plug this here you get and I'm not going to elaborate here you get equal to minus n the KL divergence between the empirical distribution which is m over n and n minus m over m so this is a binary distribution and theta and one minus it so this is a exercise I want you to check it yourself okay so that's actually very nice it's exactly the only approximation here is the stilling approximation now what it's telling us is something which I'm going to generalize and use later on in several places it's telling us that the probability of seeing an empirical sample which is far away from the expected and the expected distribution here is theta and one minus theta this is what I expect let's say and actually see m and n minus m or m over n and n minus m over n which is this empirical distribution so this is the empirical and this is the expected and you see that this is exponentially dominated by the KL by number one number KL the exponent has this pre-facto the size of my data okay that's actually this is another concentration result which telling me that the probability of seeing an empirical distribution far away from the expected is decaying exponentially with the sample size where the exponent is dominated by this distance between the empirical distribution and the expected distribution some of you may recognize an approximation to this which is known as the high square distribution essentially they are 0 minus expected over expected square or something like this which is essentially just an approximation to this so usually when we want to see how likely an empirical data is to come from this particular hypothesis we look at the high square that's what most people do for the KL derivative now this is a very special case of something which I want you all to remember which I'm going to call Sano theorem which essentially is telling you by the way let's say that I didn't know theta so let's say that theta could come from some sort of domain of possible tether that's anything between 0.4 and 0.6 what is the likelihood that it came from a theta between 0.5 or whatever something like this so then this is going to be dominated for large n again for n is very large by the smallest exponent and everything else is going to be negligible because of that so even if theta here belong to some subset of theta which is some subset of all possible theta even if by the way even if this is not a convex set which always comes to me but I don't care about it at this point this is going to be dominated by the point in my class which is closest to the empirical distribution as possible because this is going to be the smallest exponent so there's going to be some sort of a tether star where tether star is the thing which minimizes over all possible tether in my positive class in my class of parameters this scale between the empirical let's just call it p empirical and the p tether okay so this is essentially Sun of Theorem telling you if you have a possible class of models or hypothesis class everything is going to be dominated in the limit of large n by the point closest to my data in my hypothesis class and everything else is not important so this in some more careful relation just to worry about things a little more careful is always true for empirical distribution even if it's not Bernoulli even it's any general distribution any general class of hypothesis classes it's all dominated by the KL divergence so the KL divergence this number which is non-negative and it's true the distribution between an empirical distribution and any class of function is exponentially dominated by this number it also has many other interpretation like code length difference and so on I don't care about them at this point I won't I care about tunnel theory I'm going to use it now so the KL divergence is not just a measure of similarity or distance it's really dominating the asymptotic probability of distribution or empirical distribution being further away from my class of parametric distribution which I care about so you see immediately the relationship between KL and the question we asked before because we do care about the distance between our empirical data and the class of possible functions and what I'm going to argue is it's going to be dominated by the minimum in this class everything that I care about now there's some very important special case of KL divergence which is known as the mutual information of two variables so if you have a joint distribution of x and y like my input and output then the KL divergence in the joint distribution and the product of the marginal is actually telling you something about dependencies so this is going to be zero when they're independent when the product is the same as this otherwise I can write it in various ways but the important thing for me now is that it's the entropy of x minus what we call the conditional entropy of x given y which is the entropy of x conditioned on y so it's the same thing when I conditioned on y and then average over all y's so this is actually very interesting so the KL the mutual information is how much uncertainty about x I lose when I know y how much uncertainty of y is removed when I know y now I'm going to use this mutual information is very important for all sorts of reasons we're going to see it today was really very slow but never mind I have four more hours I can do a lot more so the two properties of the mutual information which I want you to remember one of them is called the Markov the data processing inequality that for any Markov chain which means y can be computed from x and z can be computed from y and so on so they have these arrows just like the arrows of the neural network the layers of the neural network the information cannot increase so the information in x and z cannot be larger than the information in x and y and it cannot be larger than the information in y and z so this is known as the data processing inequality it's actually very intuitive I mean you can't gain information if you just process so I can't gain information about x when I move from one layer to the next and I can't gain information about y either because y was to the left of x if you remember in my chain and another very important thing that I'm going to mention to the next tomorrow morning very very in detail remember that this mutual information is independent I mean this is a special case of data processing if I have a reversible transformation any permutation or encryption or whatever you want it's not going to change the mutual information so for example if I give you encrypted images they all look like white noise and as far as information is concerned it's not different from the image itself but of course neural network have a hard problem identifying it unless it breaks the encryption zone so this is something which is going to tell us that something about mutual information is not enough we need more than mutual information in order to actually characterize the neural networks so what I'm going to do now just as a video for tomorrow I'm going to show you this movie that became very famous if you follow YouTube so essentially what I do here and then I'm going to discuss this movie in great detail tomorrow actually explain what's going on so if you look at the neural network in terms of mutual information we have this change of inequalities the information about x can only go down when I move from one layer to the next because there is this Markov chain so data processing inequality tells me that this mutual information can only decrease and this mutual information can only decrease as well because remember why was here so the further I go the further I am from the desired or true label so I'm going just to give you a promo of my talk tomorrow and my main talk is going to explain in great detail what you see here so this is a numerical experiment which is what I call the information plan so what you see here is the information of layers about the input I call them t here it's just like remember the representation I call t so each layer here is in a different color this is in blue you see the layer closest to the input in yellow the next one and orange the next one and so on the last hidden layer is this in orange now this is a very specific network I'm going to argue that this is a very general picture I know that a lot of people argue with me about it generally in the sense that I'm going to explain tomorrow but what you see here is this story of information plan the coordinates here are the mutual information that the layer has about the input what I call the input the encoder information and the abscissa here is the mutual information that the same layer has about the output which is the decoder information now I'm going to prove a theorem tomorrow which I call the information theorem information plan theorem but I wanted to show so this is the initial condition of the network I mean what you see here is randomized weights Gaussian distributed around zero with some width okay that's the way we usually initiate neural networks for very good reasons we're going to discuss it different initialization I'm going to look different this is a sigmoidal neural network but it doesn't matter I mean the same the same thing happens in other networks and it doesn't matter at this point how I estimate the information I'm going to talk about it tomorrow just want to show you how it behaves when you train the network okay so this is a this is a this is a basic motivation for the rest of my talk trying to understand what's going on here so essentially what I'm saying is that well so what you see here is 1,000 epochs of back propagation which is this stochastic gradient descent and you see that during the first 300 epochs more or less let's say about 300 all the layers went up especially the last one and they did it rather quickly this is something which I'm going to associate it with the drift phase later on in the in the focal plan equation and you see that the data processing quality still remains the first hidden layer has a lot of information about the input and a lot of information about the output essentially one bit which is all the information but the other layer is lost information the last hidden layer at this point around 300 epochs of training has about 0.6 bits of information about the input by the way when I say bits you know what I mean I mean it's the units of entropy so this is about 0.86 bits one bit is the full information here because binary law one bit yes or no and about four or five bits of information about the units about the input but from this point on we see this very strange thing which we saw for the first time and I really jumped and I saw it you see that it takes a long time up to 10,000 epochs for the last layer which is really the important one to reach essentially one bit then I know that I'm okay I have one bit of generalization I have all the information about the label so it did it by this very funny trajectory which I'm going to discuss in great detail tomorrow so essentially this is the average of those plots of those spheres and you see that from A to C it went up and a little bit to the left which means it gained information about the label but it's also gained information about the input all the layers and then from C to E they did this very slow you see this the trace become very dense which means that the time step is small is dense and you see that eventually the last layer from C to E converged to a good point in the plan now there are many many questions you should ask yourself when you see this movie one of them is this the general picture or this is just a very special network I'm going to tell you that it is the second one a second question all of those are going to be discussed tomorrow what is the meaning of the mutual information I mean what is I what is this ITY actually related to generalization I'm going to prove to you very easily that it is it's a very sharp bound on generalization error but the question is what does this mean I'm going to prove to you by modifying the pack bounds that ITX is actually controlling the generalization error very much like the dimension actually not IXT it's 2 to the IXT which really controls the dimensionality so there's something completely different here instead of covering the hypothesis class I'm going to cover the input class that's that's the exercise tomorrow and then okay so what do they mean and then the question the really interesting question is why these values concentrate I mean so I have very different very different networks here you see at this point but you see that for all these different initiatives everything is randomized the initial condition is randomized the examples are randomized everything that I could randomize be randomized here and you see 100 repetitions and they look very much concentrated in this plan now when you see concentration you understand there's something about this large scale limit which seem to push them together this is what in physics we call good order parameters like magnetization of spins or like whatever what we use in in in statistical physics all the time so things that's going to concentrate in the limit are very important and you see that these two information values concentrate very nicely so we need to understand this under what condition this will happen and of course from what I showed you about entropies you already of course pixels and image are not IID in most cases the highly structured there's a lot of connectivity a lot of a mutual information a lot of a dependence between different pixels but still somehow the log probabilities are going to concentrate and that's something very important once we understand the concentration of these measures we can prove the main the main theorem which essentially the story of the learning and their presentation of the digital networks is described in this plan in terms of what the encoder and decoder are and then we are going to talk about the dynamics the actual dynamics why does it look the way it looks okay so I'm finished for today thank you very much