 Okay. Good afternoon, everybody. Okay. Good afternoon. So I'm very happy to first of all, our speaker doesn't know, but I think I see all our postgraduate diploma students are here. This is probably your first, this is your first colloquium that you'll be attending. So welcome to ICTP. This should give you an idea about the kind of science that is being done at ICTP. And I'm very happy to have today one of the leading experts in this field, Professor John Shaw-Taylor. He is the head of the Center for Computational Statistics and Machine Learning at University College London, and he's also the director of Irkai, which is the International Center on Artificial Intelligence in Ljubljana, which is a UNESCO Category 2 Institute, and he has contributed to a number of fields in arranging from neural networks and graph theory cryptography, and you will hear more about it. I just want to tell you that he has also been invited by ICTP to the 3S, the next event, and he will participate in a panel discussion on Saturday tomorrow at 3 o'clock. So you are welcome to go there because it's an interesting title. I lost my job to an AI chatbot. That's the title. So I will give the floor now to my colleague, Professor Antonio Cialani. He's the head of the quantitative life sciences section at ICTP. He will give a more scientific introduction to the work that Professor Shaw-Taylor has been doing. And then as usual, we will have refreshments on the terrace after the talk. But before that, there is an interesting just to let the new students know that we have the tradition at ICTP that they get to have an informal half-hour meeting with the speaker just after the talk before they come to the refreshments. And they can ask him anything that they want to know about not losing your job to AI chatbot. So welcome all of you, and I give the floor to Antonio. So welcome everybody. It's of course very difficult to summarize all the contributions that John gave to the field of machine learning. But just to mention a few, his main contribution has been developing the analysis and the definition of principle machine learning algorithms found in the statistical learning theory, which is the subject of today's colloquium, of course. He also contributed a lot in research management. He's been involved in assembling a series of European networks, the new occult projects, the Pascal networks, and through these networks a whole generation of scientists interested in these fields has been able to develop their interest in theory and in industrial connections. So currently is the coordinator of the framework network of excellence in pattern analysis, statistical modeling and computational learning. That's the Pascal acronym funded by the European Union. He has published over 300 papers with more than 40,000 citations, and he's also the author of a very influential textbook. Everybody needs to read that book. And like director said tomorrow there will be this session chaired by Mark Zennaro, downtown, to which you are all warmly invited. But without much further ado, I give the floor to John. Well, thank you very much for the kind introductions. It's a great pleasure to be here, great honor. It's also a great challenge because I'm not a physicist and so I'm going to have to try to explain what I do in a way that is accessible. And I think it's also very interesting to think about the relationship between different disciplines and to try to understand in a sense the different paradigms that we adopt. And I think we can learn a lot by thinking about what is and what is not relevant to a different discipline. So I think a lot of what we do is very common. Learning from data is at the heart, obviously, of physics. You have to collect the data. And you wish to learn some synthesis or some representation of that data that is explaining it, is distilling from that data some core information. Again, that's what machine learning people would say they're doing in some respects. However, I think there is a fundamental difference. And I think the key difference is the way in which we think about success or the way we measure success in terms of the way in which we distill from data that model or that kind of core representation. In physics, one is really hoping to hit on the laws of the universe or the laws of the fundamental laws of the universe. And so you're hoping to extract formulae that are in a sense universal and are in a way representing some truth. Now, of course, there's always a question of what is the final truth? And you might say the laws of gravity were perfect until general relativity came along and maybe things changed a bit. However, there's definitely this feeling you're aiming for something that is capturing the truth. Whereas machine learning is measuring success simply in how it will perform on new tasks of the same type. So it is measuring success simply by saying, okay, you give me all this data, I'm going to learn something about it. And now I'm going to say I would like to perform well on more data that might come from the same distribution. And that is very different because it's not saying there's anything right about the way the model you've learned. It's simply saying it's functionally effective. And that is the measure that I think is at the core of machine learning. And that's the kind of analysis that I will describe today, will attempt to make that translation of performance more precise. That's what statistical learning theory is attempting to do, is to develop a theory that tells us how that performance that we observe on the data that we use to train our model or to develop our model, how well that now will perform on data from the same distribution when we actually use it in practice or in making some prediction on new data from the same source. So that is the fundamental core of what statistical learning theory is trying to measure. It's not saying that the representation is right in any absolute sense, it's just saying it's functionally correct. It does the correct thing or with high probability does something close to the correct thing to be more precise. But I think that's still a very useful and interesting analysis. I hope it will be also relevant for many applications. And certainly now there is a great deal of interest in applying AI within the context of physics modeling. And obviously Elon Musk thinks that he may be able to make some progress in physics by applying AI, we'll see. But I think also there is also very great interest in developing models within machine learning that incorporate knowledge of physical processes that are actually able to leverage therefore much less data because you're actually building into your learning process things that are core properties that otherwise you'd have to learn just from data. The typical approach of machine learning is blank slate, give me lots of data and I will try to learn a representation. So that's the key of machine learning is to be able to generalize. And generalize means performance on new data, it doesn't necessarily mean having the right answer. So there's a simple example of this kind of diagram where you have these blue dots and the red dots that you need to separate. There's noise in them so not all of them are going to be exactly correctly positioned. If you try to make an exact classification of these, you end up with something like the green line which looks like it's overfitting or sort of capturing the properties of the noise whereas the black line is more likely to be the correct model of this data that is actually capturing the underlying process. Now of course that all subjective assessment are what we're trying to do is just go learning theories to make that notion more precise. So from examples kind of system learn about the underlying phenomenon and if it's just memorizing then that's probably not going to be doing a very good job and that's referred to in machine learning parlance as overfitting and generalization is the ability to perform well on unseen data from the same distribution. Okay the other thing about statistical learning theory that I think again connects it actually more closely to traditional science and physics as well but perhaps more biological sciences and so on is that we talk about high confidence of our assertions so we try to make an assertion that is well let me go into the detail a little bit and then it'll be be clearer I think so in for an algorithm typically what's happening you've got a function class that you're trying to select appropriate function from and you are given some data of a particular sample size or a certain amount of data and the idea is that data is generated according to a particular distribution and so if we imagine doing this process many times of course in practice we'll only have one training set but imagine we can generate many training sets then we're going to actually have a whole distribution of test errors each time we run a training set we'll get a different function and that function will have a particular performance on the test data so it'll have some error rate on the test test so if we do this many times we'll get a distribution of test errors and if we just focus on the mean of that error distribution we can be quite misled because in practice we only have one sample the data we're given in from the actual observations and so what we want to be hoping is that we have confidence that from that one sample we're going to get good performance and that's what statistical learning theory attempts to do it attempts to bound the tail of the distribution of those errors and try to ensure that the probability that we are beyond you know performing badly is small so in other words we want our bands on performance to hold with high probability and this is very similar to a statistical test where we might say you know at the 99 percent confidence level we are sure that the null hypothesis is false in this case we're saying there's a 99 percent confidence that the conclusion of our assertion about this particular function that we've learned is true and so that we actually are confident that the performance guarantee will hold and that confidence as I say is over randomly drawn training sets so the chances are that you know we could have been misled but the chances that this training set was so bad is less than one percent that we would have done and this is the anachronym that perhaps you might have heard of which is PAC learning probably approximately correct which doesn't inspire a lot of confidence but what it refers to is that we have this confidence parameter delta and the probability that the large error occurs or an error larger than the bound that we we have derived is less than or equal to delta so delta is this confidence level that we would you know the two percent one percent five percent whatever it might be that we are confident that the that is the probability that the training set might have misled us in our assertion and so we have a high confidence bound the probability that we're approximately correct is greater than or equal to one minus delta so this is the frame that we're aiming to and again notice that it's not saying anything about the actual function itself being correct is just saying that it gives the right outputs that it is it's it is with high probability giving the correct and now a correct performance so here's a sort of just a diagram of that error distribution so this is actually generated by some real data from breast cancer analysis and using a exactly the same function class linear functions on the parameters and a very simple pass and window estimator and a linear SVM support vector machine and we see that the distribution of errors it looks pretty similar in many respects but until we look at the tail of the distribution and then we see that the SVM has curtailed this tail which is make ensuring that we aren't we're much less likely to have very bad performance whereas the pass and window has got this quite significant tail of performance that is really bad and that's the algorithms that statistical learning theory remote or justify are those that will have that bound on the tail of the distribution so just now to move to a little bit of the mathematical formulation we're typically thinking of a training set which is a set of pairs of inputs and outputs and we're going to select a function from a class of functions set of predictors and the learning algorithm is just a mapping from those m tuples of training samples we're assuming a sample size of m so we get m pairs of input output pairs and we are trying to find a function based on the learning algorithm from those pairs so this is known as a training set or training sample and as I said the classical assumption is that the data generating distribution is the same both for the generating the training set and in terms of the performance that we measure on on the learned function the quality measure we're going to be using so the examples are generated independently and identically according to this distribution again that you know is an assumption which can break down they can be relaxed some of these assumptions but I'm not going to talk about that in this talk okay so what we want to do from an example we'd like to learn a predictor that's what the algorithm does that's that mapping but we'd also like to be able to certify some performance measure in this sense that I mentioned so with high confidence the performance measure of this function that we've learned will be will be bounded above by some hopefully reasonably tight bound on its actual performance so there's a kind of interleaving here between the theoretical work that you can do to which will bound that performance and of course as soon as you have an abound that depends on some parameters of the function that can be turned into an algorithm that's going to try to optimize that bound it's going to try to find a function which is performing as well as possible according to that bound so there's sort of this interplay bound implies algorithm and to some extent you know it also works the other way that some innovative algorithm drives some new way of thinking about the bounding of the statistical learning approach as it were and we're typically using prior knowledge that gives us some inductive bias there are theorems that also say no free lunch if you have no inductive bias then you will not be able to learn and the the tension here is to leave things as flexible as possible but still be able to learn effectively and I think that's applies to society in general we all have bias but hopefully we can be flexible enough to learn from the data that we're presented with so certifying performance it it has to come from something that we've observed on the training set that we can convert with the output of the algorithm into this bound on performance on the data that we would see coming and these goals as I said interact because the bounds will inform new algorithms so there's also at the heart of this idea is a loss function which is measuring the discrepancy between predicted outputs and the true output of a particular function that we might propose and so this loss function will then be estimated on the training set which is known as the empirical risk which is the average of that loss over the training data so called in sample so our algorithm will typically be trying to minimize this measure of performance but the actual thing we're interested in is this expected risk of a randomly drawn new example drawn according to the same distribution so this is the expectation of that risk on a randomly drawn pair x y so there's the out of sample or theoretical risk and the kind of loss functions we can consider classification which I'll mainly be talking about is just did we get it right or not so a zero one loss but then if we're doing regression it might be a square loss there's also something known as the hinge loss which is a sort of proxy for the zero one loss and if we're doing density estimation we can have something called the log loss minus log of the probability H there is understood to be a probability so this would be the closest to the type of approach that you would be taking in the generative AI which is now you know creating a lot of interest and possibly losing your jobs as chat GPT might be replacing humans now here actually log loss is not the thing that they would use for training such systems because there's you need in order to have the log loss you need to be able to estimate a probability of the output and there there's a problem with working out the the normalizing constant or the partition function in these more complex models so typically what you're actually doing is training your input to minus a few words to reproduce the full sentence so that's the typical training that happens in in these large language models so they would have a sentence English sentence typically remove a few words and then the output is the full sentence and it has to learn to fill in the gaps as it were and to do that it has to in sense infer information from the other words in the sentence that allow it to fill in that information so effectively it's turned it into something more like a classification function because now your input is the sort of sentence minus the words and the output is the full sentence and you get it correct if you fill in the blanks correctly and incorrect if otherwise but of course it is in a sense a proxy for this log loss what they're really trying to do is get a system that generates data according to the ambient distribution in this case of the internet and sentences on the internet and I think there's good evidence that it's doing a pretty good job of that but notice there's something interesting here that what is the test on if you're thinking about log loss the test is that the probability of your output of a real sentence should be high I think that probably will be true for those systems the problem is that it's not guaranteeing that there won't be also mixed in some pretty bizarre sentences whose probability are not high and I think that's you know what we see with chat GPT is or some of these language models it does generate some pretty interesting outputs that are not so accurate I was 10 years younger which is pretty good but you know it turns out to be inaccurate okay so just now a little bit of history and then I'll come to the main sort of approach I wanted to sort of introduce which is the so-called pack base which is a kind of mix of the Bayesian inference with this probably approximately correct approach but certainly the very simple building block would be to apply something like this single hypothesis bound which which is very simple to obtain and you'll see that it has this nice property that the confidence parameter comes under a log one on delta so this is the probability that this bound is incorrect we've observed some sample in sample performance of a function h and we're bounding with high probability it's out of sample so this is just simply a deviation bound which holds again the sample size enters under the square root 1 over 2m a log one on delta so very very simple bound but if we just think of a very next step simple next step finite function class we can just do a union bound over this original bound here and we simply get a size of the function class entering here we simply have to take apply this bound for each function in the class with delta being delta on a h and then when we sum through all the deltas we're still confident with probability one minus delta that this holds for all of the functions there's actually a kind of almost stepping stone towards the pack base approach which is this idea that we can associate a prior probability with different functions hi having prior probability pi and then we can get this bound where the actual size of the hypothesis class is replaced by this one on pi this is just imagine taking a uniform bound uniform distribution sorry here you end up with this bound here but this allows you to be more refined in saying I'm believing this is more likely to be a function and if I'm right I get a tighter bound as a result on that particular function if you want to do uncountably infinite function classes traditionally you've moved to something known as the VC don't that nature of an encased dimension and Rademacher complexity again very interesting theoretical work but I think the these approaches are good for the performance of individual functions and do take account of some correlations but the pack base is able to do better because it actually considers distributions of hypotheses and so that's what I'm going to spend a little bit of time talking about and then show you some of the results but to you know sort of this is trying to give you know lift the lid a little bit on okay what's what's behind this machine learning what is the actual sort of theoretical foundations of where this approach to analyzing data comes from with the emphasis on showing you know what we're measuring what we're guaranteeing is performance on new data not that the function is correct but that performance on new data can be bounded in this high confidence fashion and you'll see hopefully that this surprisingly can be extended to quite advanced machine learning systems and still produce tight bounds so the pack base framework works by considering a prior distribution this is true also of the Bayesian approach to machine learning and I will contrast the two you know Bayesian and pack Bayesian in a minute but you have to fix this distribution before you see the data so this is some prior belief of what the function class which functions in the class are more likely to arise I mean I essentially did that here with this distribution prior weight PI in a finite function class but here we're considering this more generally over a continuous function class and then based on the data we learn a posterior distribution Q over the function class again so there's these are distributions over the functions and then when we want to make a prediction we draw we get an input we want to make a prediction we draw a function according to Q according to the posterior distribution and we predict with that function and each product prediction we make we have to make a fresh draw from the distribution Q so the risk measures that we now have you know the in sample out of sample we had before are now slight generalizations because now we have to have this expectation over the draw of the random function from the posterior distribution so this was what we had previously the you know in sample error which is the measure on the training set averaged on the training set and now we're for a particular H and now we're going to average that according to the posterior distribution and similarly our predictive performance the measure on new data is again averaged according to to this Q and one of the key measures that we will see emerging is this distance measure between the prior distribution and posterior and the natural one that will emerge is this KL divergence called that Leibler divergence which is the expectation of the log of the probability of H divided by the sorry under under the posterior distribution divided by the probability of H under the prior distribution averaged under the posterior distribution here so clearly the posterior distribution has to be absolutely continuous with respect to the prior okay so what's the relationship with with Bayes inference which is a another more traditional way of analyzing of forming machine learning so called Bayesian inference which doesn't typically come with a bound it's just a methodology if you like so the way that works is you need this prior distribution which is supposed to be your understanding of the likelihood of the functions arising in this particular context and then there is a observation of the data but the way this is the inference is done is based on the likelihood that the data is consistent with a particular function and that likelihood is used to update your distribution from your your probability distribution so a particular function if it is unlikely to have generated the data that you see then its probability will fall whereas a function that is actually consistent with the data that you see is obviously its probability will increase and so you end up with a posterior distribution which is uniquely defined by the prior and your likelihood model and this is your posterior distribution now in the packed Bayes framework we're a little bit more flexible firstly sort of this Bayesian approach requires this prior to in some sense be correct of course in the end you use a tractable prior prior you can work with but the idea is that the inference is only correct if that prior is the correct prior the one that in somehow encapsulates really the likelihood of that function arising in the packed Bayes framework we simply are free to choose any distribution that we like provided we choose it independent of the training data um and then we are able to take any posterior distribution obviously we're going to choose one that does I mean there'll be criteria that will be good for choosing that distribution but in principle the the sorry the theory works whatever posterior distribution we choose is just the tightness of the bound that is affected by the distribution the choice of distribution here so the way in which that enters into that tightness is through obviously that distance we move from the prior but also the loss function so we can think of the prior as some sort of exploration mechanism over the function class and the posterior is the updated prior after confronting the data so just to gain a slight comparison the packed Bayes the bound holds for any distribution but with Bayesian inference the prior choice is is somewhat more difficult to justify and the posterior in the packed Bayes the bound holds for any distribution in the Bayesian sense the posterior is uniquely defined by priors to just do a model and then if you're doing approximate Bayesian inference you will try to find a good approximation to the correct posterior distribution by something like variational inference or something of that kind so the data distribution is also sort of slightly differently involved in the two cases in packed Bayes as I've already indicated statistical learning theory so we're thinking about an iid distribution generating the data training data and a similar distribution same distribution generating test data to our measure of performance in the Bayes framework the randomness actually lies in the noise model so you have an input and your noise model is used to generate that output for that input and that's where the actual randomness is that is used in the Bayesian inference okay so here's the sort of general packed Bayes theorem and again I don't want to get I'm not going to you know get too much into detail here because I realize this is you know sort of getting a bit buried in the in the formulae but it's I think the the frame of this is is interesting to see what we're looking at is somehow a distance delta is a distance function think of it this will be a actually in the classification case it'll be a simple KL divergence between discrete distributions but think of it just as a distance function between kind of in sample or training loss remember this was the average loss when a function was drawn according to the posterior distribution the difference between that and our out of sample so this is something we can measure and we can optimize for in our training process and this is the thing we're really interested in bounding this is the thing that will tell us how well our system will perform on new data and that distance is going to be bounded by a a fraction which depends on one on M so the sample bigger the sample size the smaller this will get and the KL divergence or the distance between the prior posterior distributions and then a term here which actually is again the log one on delta you know this is the delta is the confidence parameter probability greater than one minus delta this holds and this expression here which is something that we will have to bound in different applications but for example in the case of classification this is to root M so it's a it's a relatively small and it's under under the lock so this is a think of this as a relatively small effect so the big effect here is the scale divergence and so what this is telling you in terms of algorithms is when you're choosing your posterior distribution you have two things you want you want it not to be too far from the prior in order for this term to be small but equally you want it to perform reasonably well on the training data in order that this is small and if you can keep those satisfy those two constraints then this bound will tell you that this is also small and so you have good performance on your test data okay so this is just to mention I won't go through it but the sort of idea behind the theorem is you look at a change of measure inequality and that gets you this KL term between prior and posterior distribution and you need Markov's inequality in a simple way and then you put those together in a relatively straightforward sequence of inequalities Jensen's inequality the change of measure Markov's inequality and then expectation swap and then this binomial law that allows you to bound this quantity here which was actually that expression that I sorry this here was this expression that I mentioned which is the supremum of this quantity in which you compound in particular cases so this is the way in which this works this is the form of the general theorem here at the top and these are now particular instances if you think about classification you actually get that two root M I mentioned here and this turns out to be the small KL which is the KL between the binomial two binomial distributions one with probability Q and one with probability P so this is again you know very natural measure of distance between two probabilities and those are the two probabilities you're measuring a distance between so provided you keep this side small you're going to have a tight bound on the test set performance this is more general for regression as well but a square root so it's actually a weaker form and then Katoni has a form which is slightly more convoluted but sort of goes somewhat between these these two and there's a further one which is again a slightly different form but again these are all dependent on different choices of the delta function okay oh just to mention I won't go into this detail but again this is just working out that that constant and notice that the theorem actually is true up until that last step of swapping those two expectations even if the examples are not drawn ID but that's a again important observation if you want to generalize two cases that the data is not ID okay so now I'm going to just show an application of this which will sort of lead to support vector machine algorithm and also show some results to show how tight these bands are in different cases and then I'll move to just a little bit of generalizations of those ideas and finally a little bit of an application to deep networks so in the in the linear case or support vector machine case we're thinking of linear functions in a space that may be defined implicitly through a kernel function and the way we place the prior is a simple Gaussian at the origin so we're free to choose it remember any way we want because the bound will be true the tightness of the bound may be affected by the choice of prior and posterior but the bound will always be valid and the posterior will again be about Gaussian but centered at the weight vector that corresponds to the support vector output solution scaled by a factor mu so here's your prior here's your weight vector that you learn from the SVM here's your scaling mu here's your posterior distribution and here's the form of the bound so here the thing we're looking at here is this expected performance on test data of that weight vector and scale so it's the true performance and in fact you can bound the real support vector by observing that the this actually is at most sorry the true error of the single classifier this is remember the error of a distribution of classifiers is at most twice the error of this stochastic thing so we can actually bound the error of the deterministic classifier by twice this stochastic classifier by a simple observation this is the error on the training data which we can actually measure exactly and it's basically uses the margin appears and the cumulative normal distribution this f tilde or one minus the cumulative normal distribution and mu times this margin or scaled margin and the KL divergence is particularly simple because it's two Gaussians with the spherically symmetrical and so the KL divergence just mu squared over two and that's the confidence parameter which we've already talked about and so the actual form of the final bound is the probability of misclassification of a test point randomly drawn test point is twice the minimum over mu remember we can choose mu posteriors anyway we want the what I'm calling the KL inverse which is the largest P that's consistent with KL QP being less than or equal to a so KL inverse of this empirical error and this right hand side that we gained from the KL term and the other term in that bound so that is the form and if you just crank that into an algorithm you end up with the SVM optimization except that you have approximated this cumulative normal or the one minus cumulative normal by a hinge loss function which is not a very good approximation but it means that the the optimization is compact so it makes it much easier to solve so that's the sort of frame a slight extension is to consider choosing a better prior distribution but I don't want to spend too much time on that because I'm worried with a little how much time do I have left but maybe I can say just a little about the prior distribution so the there's a possibility of choosing a better prior distribution they're just putting the prior at the origin and one way of doing that is to look at learning a prior from part of the data and then using the rest of the data to actually evaluate the bound and and train the full SVM so this defines the prior in terms of the data generating distribution but one can also consider doing that directly implicitly in terms of the data generating the distribution I'll come to that in a minute but the look at first is this idea of using part of the data to learn a prior and that would you would expect then to give a much tighter bound on the KL divergence between prior and posterior at the expense of of course having a slightly smaller numerator because the size of the training data that you're using to evaluate the bound is is small so we use that and introduce the learnt power into the bound so here we learn a prior with part of the data so we now choose a scaling of that prior we then choose a posterior perhaps with the full training data with the SVM here's our posterior distribution and now the KL divergence will be this distance rather than the distance from the origin and this gives us a new bound and this is again same evaluation this is true performance this is the evaluation on the training set but remember it's now an average over only the data that wasn't used to train the prior this is the distance between the learnt prior and the actual posterior center of the Gaussian and this is now the divider is by the divisors by the size of the test data that sorry the data that's not used for training the prior so this is m minus r here and the new bound is proportional to this much smaller quantity we can also think about optimizing this again you know this circle of new bound gives new algorithm so we can actually devise an algorithm now that optimizes w to minimize this bound that we have and that's known as the PSVM again I won't go into too much detail here but it again we show how we can create new algorithms from the way in which we've evaluated the bound and there's even a refinement of the of the bound itself by taking an elongated prior rather than a spherically symmetrical we can take our prior to be more like sorry our posterior to be more like a rugby ball and we get a slightly more complicated expression here again don't need the details not important but it allows us freedom to choose the scaling and again it leads to a different algorithm and it's a price I'm just trying to give you a sense of how you know algorithm and bound interact and we get you know new algorithms emerging that actually are optimizing the bounds in new ways so here's a comparison now of all of these different approaches and we're comparing to something called tenfold cross validation which is a more classical way of setting hyper parameters or the parameters that control the actual algorithm itself like the setting of various regularization parameters and parameters of the kernel which you use in the small vector machine for example so normally this is done by something called cross validation which is has no theoretical firm foundation so it's it's it's a kind of heuristic but it seems to work quite well whereas what we're doing now is using these bounds which is now guaranteeing performance and we're making a comparison to see how well we do in comparison to in terms of test performance but also how tight the bounds are and we're doing this on a quite a few data sets and here's the kind of performance so I'll just talk very briefly about this these are the different data sets down this side here the test error TE is the test error so these are the test errors that we get in different circumstances this is using twofold cross validation tenfold cross validation a twofold cross validation would be a slightly fairer comparison in terms of complexity but tenfold is the more standard that one would use this is using a very sort of naive packed base bound this is the prior packed base bound and this is the eta prior and the tau eta prior those were those two more sort of developed algorithms so the first thing to note if we look at the average at the bottom here over all of these data sets so here are five different data sets the test error that we get in terms of the performance having used different methods for cross validation actually the class the simplest packed base bound gives the tightest error which is kind of interesting but if we look at the actual bounds on error these more refined algorithms and and bounds give much tighter bounds on the actual error so here we get a very you know very tight bounds in performance in some of these cases that you know here we have 176 as opposed to say 203 here 0047 and paste a point 175 so the bounds are extraordinarily tight but they actually don't lead to a better choice in the in the model selection which is disappointing but you know that is you know sometimes things work differently from the way we think but overall I think you know we're very happy that we can actually derive bounds that are you know not at all loose we're looking at a factor of three roughly between the test error and the bound which you may say is loose but when we started out in this game we were a factor of a million difference between the bound and the and the actual test error so it you know it it's reasonably tight and the model selection is as good as 10-fold cross validation so actually this is a valid method of doing model selection modular this slightly surprising fact that the looser bounds are giving better model selection so here's just a couple of papers doing those kind of things the next thing I wanted to mention was there's some interesting aspect here in terms of the pac-bays approach because if we look at the actual bounds themselves the prior distribution only enters in one part of the bound it's just in that KL divergence between prior and posterior everywhere else we're evaluating the performance with the posterior distribution obviously also the evaluation of the test performances is with the posterior distribution so there's this possibility that we might be able to bound that KL without necessarily knowing the exact form of the prior distribution so this is known as a a distribution defined prior so we can actually define the distribution the prior distribution in terms of the underlying distribution and that seems like it might be cheating but as long as we don't use the data that was generated according to that distribution then we're okay so here's an example this is due to Katonia originally where we've defined the prior distribution as the Gibbs Beltzmann distribution according to the true risk the true risk of the function h and the posterior will be defined in terms of the empirical risk that we can evaluate okay and based on this there is some manipulation you can make to get a bound on the KL divergence between these two distributions despite the fact that you don't know what this distribution actually is and it gets this form which looks quite clued because you've got a one over m to the three over two entering in which seems like a quite tight bound and but there's a dependency on this gamma parameter that is the the temperature or inverse temperature parameter so very kind of intriguing however in practice this is not as effective as you might expect because it doesn't take into account the properties of the actual function that you choose and in the sense the function is chosen for you it's this this is the function that you're actually bounding or this is the distribution function sorry I should say so despite this being an intriguing approach it doesn't appear to bear fruit in practice there are other ways in which you might be able to choose a prior distribution without necessarily actually knowing it and we've looked a few of those but here's one for example where you might define the prior to be the expectation of y times 5x in a sport vector machine case which has a logic to it and you can actually estimate then the distance between this an empirical version of this that you can actually work with and this true version which is your prior and then you know kind of a triangle inequality with a distance to the posterior distribution but again it's not that impressive in terms of the actual bounds that you get out anyway it's connected to stability I will skip this but you can there's an analysis of stability that is quite nice that enables a bound on performance which was derived by Busquet and Alicia some some years ago and we get a kind of tighter version of that using this Pac Bayes approach in terms of the now the distance between the single weight vector that we observe the one from the training set and the expectation of the weight vector we would get if we averaged over infinitely many training sets and we're able to bound this distance using this stability analysis but again the tightness is not actually that good okay so my final thing I wanted to mention talk about is performance on deep neural networks can we apply this this is all well and good we are making you know very interesting bounds for support vector machines but you know people are now really interested in how this can translate onto deep networks which are very highly parameterized extremely complex hypothesis classes and a seem to be sort of at odds with the approach of statistical learning theory of in some way bounding the complexity of the hypothesis class and therefore there's a kind of expectation that the bounds will not generalize to such complex networks and there's a bit of feeling of magic going on you know what's what's going on so in a sense it's perhaps not as far as that might sound because in a way support vector machines are already very complex machines if you're using you know something like a Gaussian kernel I if you're not familiar with that but it's basically a function class that can approximate any function so you you are putting in a very great deal of complexity into the functions and the way that support vector machines effectively manage that complexity in the bounds that I've shown you is through this margin which we saw emerge as a measure of the complexity of the empirical loss and therefore you know how far we need the KL to stretch in order to create a small empirical loss if the margin is big we don't need to create such a large weight vector in order to get the good training set performance so the margin emerges very naturally as capturing some complexity measure empirically measured of the function that we're trying to learn so therefore the same idea what about in a deep network if we have a wide basin of good performance around our local minimum that we've actually trained to remember by gradient descent is the typical way that this is done if there's a wide basin of good performance then we can think of this as equivalent to a large posterior distribution that we can fit that is consistent so we get a again a much smaller KL term coming in still delivering good empirical performance and this idea has been captured also by Jugate and Roy and some of the derived some of the tightest bounds and again the idea is simply to push into your training regime an additional term that tries to go for a broader basin of attraction as measured by this posterior distribution and so we're able to apply this and in some sense it appears that the stability of stochastic gradient descent is also important in attaining good good generalization but the pack base bounds we have are not able to capture that so we need to actually put into the training regime the broadness or the stability of the this local minima which comes directly from the pack base bound that we derive so if we do that we've tried to again use this idea of training part of the a prior from part of the data as we did with the support vector machine and then the second part of the data is optimizing this pack base bound which tries to effectively look for a more flat local minimum and there are sort of different flavors one is using this KL term for the the delta that I had in that original bound which we're going to call F classic this is a slightly different one F quad which it gain applies and there was also this F lambda which was the lambda bound I gave earlier and then FBB BBB which is basically just doing variational inference so back to the sort of basian standard basian inference there does seem to be a good correlation between the bounds and performance this is just a plot of the risk certificate versus the performance so model selection results for more than 600 runs with different hyper parameters so even with with convolutional neural nets with a Gaussian data dependent priors and using this was for MNIST so a relatively simple data set but still we seem to be getting reasonably you know good good indication that we are capturing something about the data so this is for MNIST and again this is fully connected neural nets or convolutional neural nets random initialization or learnt and then the different optimization criteria and what we see is the here's the risk certificates and they are getting in the case of F quad well all three apart from the approximate basian inference are giving reasonably tight bounds compared to the stochastic predictor so this is the bound on this property here and this is the zero one loss this is sorry this is the zero one loss and this is the cross validation loss but okay if we look at the zero one loss we're getting sort of 0.279 as a bound 0.0202 as performance so we're actually getting remarkably tight bounds using this type of approach and even in in the convolutional neural networks as well again getting very perhaps a bit weaker but still reasonably good bounds provided we use the learnt prior the random initialization is far weaker so this is for MNIST quite a small data set but we've also done it for CFAR 10 which is a medium sized data set it's not one of the big ones but still we're also able and with really quite complex networks 15 layers here convolutional neural network again we're able to get risk certificates that are of the same order of the stochastic predictor performance and so my kind of take home is that yes we're not able to capture what's happening fully with stochastic gradient descent but we're certainly able to put into stochastic gradient descent factors will ensure that the bounds that come out are commensurate with with the with the actual performance so we are definitely able to capture even in these very complex networks some fundamental properties of the networks that are actually translating into good performance so it's a generally a flexible framework I just put in here in case anyone's interested some some links to references for different applications as well so transductive learning non-IID I mentioned uh heavy tail data density estimation reinforcement learning sequential learning and so on and the deep neural network so in in conclusion I think one of the key key questions in learning is generalization and as I emphasize generalization is performance rather than assertions about the accuracy of the model in in its structure it's the how well it performs on new data modern machine learning appears to contradict many of the conclusions of statistical learning theory but by modeling in a more refined way through using these distributions we appear to have overcome this contradiction at least partially and thrown light on the different ingredients in achieving good test performance and this can drive algorithms that give improved bounds and state-of-the-art performance but you know I certainly wouldn't claim that this is you know understanding of all of deep learning you know if we think about transformers or you know large language models of course we can approach some of the analysis with this approach but I think we're far from having that analysis and I think you know theory is quite a lot behind where it would need to be to to to understand that and these are just some references so yeah that's it thank you thank you so we're ready to take questions from the audience we'll move on that side of the hall start so thank you for a very nice colloquium so the question is about the performance how the performance depends on the structure of the data in the sense that I mean looks like these pack bounds are quite general they don't depend on what the data is I mean whether it's just pure noise or whether it is some structure and it's hard to understand why this is why this I mean it should be that if there is a strong signal in the data then it should be easier to generalize and so I don't know whether you can comment on the role of the data no definitely um so first to say you know these bounds will not I mean obviously they are provably correct so if the data is random these bounds will be poor no question they will not be tight the thing that is surprising with deep learning is that even if you have random data you can still often on the training set do very well so what's going on you know um but what you will end up with is a very brittle local minimum that is just and you know the bounds then they require to be tight a very flat local minimum and that's why we put that into the training I mean that effectively enters the training through the pack base bound but if you put that into random data of course you won't converge to a flat local minimum because there won't be one um so you won't be able to actually get a bound that is tight and in a sense you know this is very a nice property because sometimes there is this feeling well if I can get my performance to be good on any data how do I know whether I've learned something or not and that's precisely what the theory is attempting to say yes I can actually now guarantee you that in this case because you found this nice flat local minimum that you are able to generalize from that so you know in a sense I you know I think you've hit them hit the nail on the head in terms of what is the key benefit of actually making this analysis thank you and I have two questions and assume that we don't know anything about the prior distribution is the distribution defined prior sufficiently close to the prior unknown prior so if we don't I mean okay so I think there's really good question because I think there's this sense that people talk about neural networks as somehow magic you know they can learn any function and then you know they learn it and it generalize that's not true absolutely not true they can only work if the structure of that neural network is implicitly putting a prior distribution over the functions that you might learn the structure of that neural network has to be something that is helping it to learn a function so I don't again I don't think we understand this sufficiently well you know I think it would be great if people were spending a bit more time thinking about it there have been some attempts you know the people are trying but it's not very popular because I think the popular thing is to say this is magic and and you know like the incense or something you know but yeah that is I think the key question you know what it why is it that the structure of a neural network of course there are obvious things like the convolutional neural networks that are building in some symmetries very nice very natural thing to do but there's a lot more than that going on that must be helping it to learn that kind of function so the what I'm saying is that the prize is actually in the structure of the network and in our case also the distribution we put on it but the distributions we put are just simple Gaussians independent Gaussians on each each weight in the network so it's nothing fancy you know it's actually the structure of the network thank you so the second one assume that the iid distribute iid assumption on the data points phase what happened to our model so there's been again quite a lot of work looking at this so different approaches I mean the obvious way it might fail is if you have sequential data and then there's some sort of you know I don't know a gothic sequence in which you know the next thing is somehow partially conditioned by the probability and typically what happens there is you just need to look at the mixing time and roughly speaking you get an iid sample you know every mixing time so it obviously dilutes your data hugely in terms of then there are other sort of more complex interactions where you might have a graph structure with examples of the nodes and there's some sort of dependency between the nodes and again you can there are some analyses about how you can sort of look at coverings of those graphs and sort of develop what are effectively bounds on the sort of effective number of independent samples but essentially what you're doing is trying to reduce it back to effective number of independent samples thank you thank you very much on very nice talk I mean as you slides with the references pointed out it would take a few years to formulate a very good question I guess but there is a essentially a large field of research that probes the tails in deep neural networks these adversarial attacks and and actually after they discovered the adversarial attacks in neural networks people went back and found adversarial attacks on logistic regression SVM and so on so where do they fall in in your theory this pathological yeah it's a great great question so the theory really is about probabilities of you know data and the likelihood of a random data so adversarial by definition are chosen specifically to you know find out a weakness and so this will not cover that you know it will not deal with the adversarial problem I guess my take on that is well there are two views one is I just gonna I'm gonna waffle now but it's sort of you know I like the question so if you don't mind he's gone so one take is you know the assumption is that we are different right an adversarial we wouldn't be so susceptible to that adversarial attack we show the picture with the adversarial attack it's still a panda or whatever it was you know but the problem with that assertion is that it's not an adversarial attack for our brain it's an adversarial attack for that neural network maybe there is an adversarial attack for our brain that would fool us with a very small you know defamation of the image we don't know because we don't know how the brain is actually processing the data so that's one argument but I think I prefer the other argument which is that we have a backup mechanism so if something doesn't immediately trigger oh that's a lion or that's we start to look at the components of an image and say well that looks like a lion tail or that looks like you know a panda you know coloring or something or fur so we start to decompose and we have this ability to decompose and reassemble the information which a neural network doesn't have um well maybe transformers to some extent do you know that's the kind of thing it's sort of close to what a transformer does because it says put my attention on this of that feature in the image so maybe transformers are sort of trying to fill in the gap there but I think that's where we are able to in some sense mitigate that potentially dangerous adversarial attack what that tells us about how we what we should do for neural networks I don't know but this theory is not really able to handle it thank you for your presentation like I have a question related to uh understanding the relationship between the statistical learning theory and machine learning because uh I don't really know if it's focused on uh on how how we can use statistical learning theory to explain some machine learning algorithms that sometimes can be found can be taken as the black boxes or the statistical learning theory is also another method which can be used to learn data and use some some technique like confidence interval or significant test to prove the to test the accuracy like I'm really confused about the concept like the cute relationship between the statistical learning theory and machine learning okay um well I certainly should emphasize that it's not the only theory of machine learning um you know Bayesian analysis you might say is a theory of machine learning in the sense that it motivates algorithms based on a theoretical model um more traditional statistics also um what I wanted to say was that it tries to distill what I believe is at the heart of most machine learning algorithms which is a desire to get performance on a test set by adjusting parameters based on a training set um in other words what it's trying to do is you know learn something about an underlying phenomenon from a finite set of data okay and then the key question is you know how much data do you need how do you actually set up that learning process in order to ensure that you do learn something about that phenomenon and I think you know I emphasize this difference with physics in that you know physicists would say you know yes I but I want to know the truth I want to know the actual structure of my of my answer and I think you know machine learning typically both algorithmically and and then also in the theory is saying now I'm just interested in the performance that I get on that so what I'm trying to say is you know my view of machine learning and you know there are many many brands of machine learning so you know it's I wouldn't want to claim this would apply to everything but my view would be machine learning at heart is an attempt to take data extract from that data some useful sort of function or model and then use that model to you know analyze new data and and hopefully get some value out of that new data because of the way in which I'm able to to to map it in a way that is appropriate that I've learned from that original data source I don't know does that answer your question not really uh yes please do think that uh theoretical learning theory a statistical learning tool can be can become irrelevant one day with the imagines of deep learning or orders of sophisticated machine learning techniques so again you mean do I think that it could apply to the whole of machine learning or think it it can become irrelevant someday because of the deep learning algorithm or some sophisticated machine learning techniques I mean I think you know it no I don't believe it should become you know so I think of you know engineers right so an engineer builds a bridge right and you'd like to be guaranteed that bridge is not going to fall down so you have engineer theory whatever you know and you say right this bridge is solid it's gonna it's gonna hold until some flood washes it away but you know hopefully it holds you know in a sense we'd like to have the same kind of engineering approach to you know building machine learning systems so as I pointed out you know you can actually get a machine learning deep network to perform well on training data that is randomly labeled so you know actually it's not going to perform anything on the test set because you know it's just random right by definition it's it cannot learn that function and yet it performs well on the training so how are you actually going to know that that is the situation and for that you need something like statistical learning theory I don't say only some way of measuring you know sort of guaranteeing or sort of designing your algorithms in a way that you're guaranteeing that the bridge isn't going to fall down in other words that the neural network is going to perform in in the situation in which you're applying it so I think you know a lot of the effort I mean you know this is a UNESCO center and our UNESCO center has also kind of been involved with UNESCO around the ethical aspects and a lot of that is trying to understand whether we can say something more solid about the performance of neural networks and in a sense making that more transparent to users so you know a user will actually understand okay this neural network is telling me something but it might not be true it might be you know it's only got a certain likelihood of being true and I shouldn't put too much stead in it or I should at least be prepared to question it and try to find out the sources of the information that were used in that in that training scenario so I mean I think you know we're seeing that particularly in the way you know learning systems are being used to manipulate people typically to you know click on ads I mean that's one of the key applications of machine learning is is to try to optimize content to a particular individual because you know where they are most likely and and it's clear that people are not aware what's going on and you know so it's it seems to me quite an important aspect of trying to get a more solid analysis of what's going on in the same way as engineers would like to understand the structures that they're building thank you thank you for a very nice talk everything you said was for the case where the distribution of training and tests are the same but of course domain shift is a major problem presumably your bounds if any can be derived in case of domain shift are much weaker can you still guarantee that the bridge will not fall down in that case very good question yeah I mean the yeah absolutely the bounds are very much weaker but also they're indeterminate I mean the thing I like about these bounds is everything can be evaluated you know you can measure the actual empirical error on your training data you can measure your KL divergence and you have a bound but as soon as you I mean there are other bounds known as oracle bounds that are dependent on some assumption in the data if the function is from a class that is you know second smooth of some kind then you know and the distribution satisfies property then this algorithm will perform well but there's no way of checking whether the function does or whether the you know and similarly with domain shift you know how much you know how do I measure the main shift what's happening you know so the way that would typically be done is by monitoring performance and as you see performance you know deteriorating you'd say okay this is now inconsistent with my bound and therefore there must have been some domain shift and then perhaps you can make some adjustments at that stage sorry the the second part of your question you said the domain shift and guarantees in that case yeah so I say I there wouldn't be guarantees that we could unless you made assumptions that you can't verify so you know they're not guarantees in in a real sense at that point but yes there are bounds of the type that say you know if the KL divergence between the you know distribution and the shifted distribution is less than then you know you could have a performance guarantee but yeah I wouldn't put much stead by it and maybe I I have a question too here so I was particularly intrigued by the refined buck based approach and especially if I get it correctly in the part where you say that use part of the data to construct the prior and the rest to come to the posterior and and the question I mean is how does this fraction of data that you use for the prior scale with the total size of the data is it proportional to the size of the data does your approach predict any kind of scaling like this or any if it does does it depend on the fact that the data are unstructured because I'm sort of asking the questions in in a learning process where the data are very structured you probably from a few data can guess a good prior and from that sort of throw the rest of data into your posterior so I'm trying to sort of understand what kind of things you can learn from that this qualitative semi-quantitative level yeah um so yeah it's interesting so I think I mean so first the theory doesn't predict any any um and I think it is very much a data dependent thing and you know to some extent you could I mean the very naive might say you know let's say it was a linear function class you know once you get to number of examples similar to the dimension of the space you start to be able to learn and you know maybe you know a bit more than that and you can probably generalize so you'll be already in a good prior but as soon as you're less than that you'll probably have a hopeless prior you know really terrible so um you know there might be quite you know significant shifts at certain points depending on the size of your data um I mean the other argument is you know if you have any humongous amounts of data you only need to hold out a bit to get a very accurate performance um measure because you just use that as your test set so you know you just train and in a sense the interest in uh the theory is when you're are somewhat constrained with data and then you know where is the trade off our feeling was you know you need to spend at least I mean the trade off is obviously the prior gets better but your weakness is in the um you know uh denominator of the fraction because you you're going to get less and less data uh and so on your test part uh on the sorry the part that you can use to evaluate the bound um so you it's clear you can use all of the data to do the training it's just in the evaluation of the bound you used the remaining data so um you know our experience was at least 50 percent and typically even higher than that you know 70 percent was was a good number in the kind of training regimes we were in okay if there are no further questions then we can thank uh John again I will ask just the diploma students to hang around for this we need to take the picture picture sorry that's a photo this everybody come here