 What I want to talk to you about today is an exciting field Called conformal prediction and it's exploding with research and new ideas at the moment And so I want to tell you a little bit about what's old and a little bit about what's new and so Let's get started and we're going to introduce collaborators a bit later on Let me just jump into the subject. This is a map of the United States Showing the results of the 20 election the prediction election county by county So each county is indicated by a color and if you see a dark blue color It's a county where Joe Biden improved enormously over Hillary Clinton from 2016 to 2020 by about 50% on the other a dark red Color indicates a loss for the Democratic candidate a loss of about 50% And so you see from this map that has about 31 in the United States See from this map that Joe Biden did pretty of course. We all know that he won the general elect now If you were to work for a news organization such as a Washington You would like to be able to tell Your readership what how you think the election is going to end up looking like once all votes have been tallied And so this is a task that the Washington Post set out to do and as you'll see they're using a Tool that you'll see in these lectures But this is an application where the prediction is very important and the cost of being wrong Extremely high that is calling Pennsylvania for the wrong candidate has enormous consequences And so what they set out to do is Basically at any given time in the election you see a number of reported counties And I'm not sure you're familiar with elections in the United States But it's a given time in the United States and let's say 1200 counties have reported you have 19 Outstanding counties and you would like to predict the vote in these counties that you have not yet seen in such a way that you're correct 90% of the time and So what they set out to do is each unreported county they set out to issue a predicted range which is correct Time that is for the 1900 Is the range needs to include the true 90% of 1900 times and So this is work that was done by John she and the graduate student of mine at Stanford and Laney Bronner who is a data scientist at the Washington Post desk and So if you were to look at the Washington Post portal Soon after the election and here is a shot. I took on November 5th 2020 at 1250 a.m The election was on November 4th at 20 You would see that even though a pen job Donald Trump time as far as the reported votes was leading in Pennsylvania The Washington Post Actually predicted that Joe Biden would end up Winning Pennsylvania, but not only is this as I mentioned before They wanted to actually tell facefully to their readership how they think the election is going to end up and rather than just giving you a point Forecast they give you a distributional forecast which is indicated by sort of this coloring Where essentially a dark color is the median of the forecast and as you go A bit on the right you get in the tailors the forecast and they wanted this forecast to be well calibrated in the sense that I described before and So right now we see that the median forecast for Trump is a for Biden is a bit higher than the median Donald that is a correct prediction and It was very interesting to see is during the election a week you know we would see plots like this and then as more and more votes would come in of course the Gap would widen or the distribution range would narrow All right, and that's how they communicated what they knew about the election and how now I don't think I need to tell you that We use machine learning nowadays in extremely sensitive applications. I don't know. I think I had this discussion last night I don't know if you know about this, but in the United States we use machine learning to actually Decide whether inmates will get parole or not. So these are extraordinarily consequential decision We use machine learning to put you on a treatment plan We use machine learning for self-driving cars Things where the cost when you're wrong the cost may be extremely high and so this asked a question which is Since we use these systems in critical applications. Can we be sure about their predictions? Can we have confidence in these predictions and that's what we're going to talk about? So, you know at the risk of being a bit pedantic This is a bit what I consider to be data ethics 101 You know as we use these systems that are extraordinarily complicated It's very important that I'm you know to understand how certain I am of my prediction when I communicate To a judge whether a person is going to come in another crime if released from jail You know I would like to certainty about this prediction I Would like to know how my level of uncertainty my decisions if we think about finance for example You know if I say well here's a big portfolio and I expect its return to be five percent There's a big difference between five percent plus or minus one percent and plus or minus ten percent Can you deploy an AI model safely and I think To be explicit about all of this to help users so Solution to quantifying uncertainty It by means of what we're going to call prediction intervals or prediction sets And I should say up front that there are not confidence intervals for a very obvious reason that it's going to be clear But Basically the goal and the thing that we're going to talk about in in this hour is as follows So you have training data so excise might be the covariates of county You know demographics about the county etc. The level of education income whether it's a rural county or not The size of the county and so on And then you have an outcome like how does a county vote? And we're going to assume that at the at the beginning that this data is assumed to be exchangeable and If you don't know what exchangeability means that's that's all right You just can't assume that the X I Y are your I ID from some unknown distribution P the classical machine learning assumption And the goal that we're going to set out for ourselves is to actually based on the training data To construct a prediction interval that is the range of possible outcomes that we're going to call see effects What these was it needs to contain the true label a certain fraction of the time and for this lecture We're going to just say 90 percent. It's whatever you decided to be but it's going to be a certain fraction of the time So I need to make sure that from the training data. I can construct a prediction set that contains a true label 90% of the time and the surprising thing is that I want this to hold No matter The distribution generating the data no matter there's a sample size no matter the dimension no matter nothing and The first time at this well, you'll say I can't do it and of course if we have this lecture is that we can do it And what I want to be able to say going back to my election And I don't know if it's still working I hope it is the vote change from 2016 to 2020 is predicted for that county to be three between three point one percent Seven percent and I need that whenever I make nineteen hundred predictions of this kind 90% of them are okay. All right So I also want to do something else, which is I want to use these fancy machine learning prediction algorithms and Well, it's difficult to quantify their uncertainty because I can't barely understand them But still we want to use them and that's what we're going to do as well So to do something like this and I guess that's the main message of this lecture is I'm not going to open the black box It's too complicated. It's way beyond my analytical powers What instead are we going to do is we're going to build a predictive layer around the black box and so we're going to look at the predictions of Machine learning and we're going to use build a layer of protection on the machine learn the prediction algorithm There's a kind of prediction intervals. I talked about with the guarantees that I want and I'm going to explain to you how we do this So the first thing you might think about doing if you want to kind of predict the accuracy is Okay, so I have some data points and they're indicated in gray on this and then I fit a model Indicated by the red curve and of course you have to imagine that I use the deep net to do something like this in modern and like the Residuals Might be Say that well, I'm going to look at the errors and build a predicted range which is around the model prediction the size of the errors and that natural thing to do Except that I don't need to convince this room that this is completely silly This is completely silly because we all know that training errors are much lower than prediction errors And in fact when you train machine learning algorithms these days We train them to zero training error And so you would have a band of your width and clearly it's not gonna in contain future labels So it It's very extreme for modern methods Informal prediction comes into play and I should say that conformal prediction has been largely developed by a hero of mine named Vladimir Volf who started to work on this about 20 years ago and what he showed with His colleagues including Glenn Schaefer that you see pictured here is that you can actually build valid prediction intervals No assumption whatsoever except that I gave you Expendible training samples for IID training samples. So how does this work? so Rivers, I'm going to try this one. Thank you Normal prediction comes in many different flavors. You're going to see a fancy flavor, which is called full conformal and It works roughly like this and then we're going to see a simpler flavor in a minute But here is one way you can deploy conformal prediction so I have a data set and here it looks like what we see on this on this chart and My boss comes around and I see a data point With Xn plus one equals four point seven and I need to predict The why value that I'm not seeing in such a way that I'm correct 90% of the time So the way conformal prediction is going to work is I'm going to hypothesize a value of y a candidate value little y and I'm going to fit a prediction model perhaps with your deep learning or Random force or XG boost however complicated. This is I'm going to fit a To the data Including Xn plus one and the hypothesized value little y I'm going to get residuals Then I'm going to do is I'm going to see how the residuals for the imputed value y how unusual is it? So to do this so I have a little animation and here it works like this. Here's the fitted model. I Compute residuals and I look at the magnitude of the red residuals the value little In how unusual use I'm going to put some sort of a p-value and here I see that well this residual is In the top 27% of all magnitude of residual magnitudes And now I'm going to change the value of little y and I'm going to refit the model and I'm going to play the same game again And now I see that well this is a bit larger because why moved up and now it's in the top 22% And now we're going to keep on moving and Animation and note that I try to be careful which is when the red dot moves up It pulls the model towards it right? So if you look at the black curve, it's being pulled upwards by the red dot So we go we go we go and the residual gets larger and larger obviously Okay, and there's going to be a point where no matter why I put it. This is just residual So I've trained all this train all the I'm looking at how unusual how it conforms to other residuals and So for each hypothesis type is value of of why I have a p-value if you will which is in what top percentage you belong to and what I do now is I'm going to build a production interval in such a way that I'm going to include why if it's in the bottom 90% and Not if it's in the top 10% so that it's usually sufficiently small and Mean in the bottom percent. I keep the y value in my prediction of all otherwise I don't include it now you'd say this is a Perhaps a good method will see that it's actually delivering the promises that we wanted You'd say well that expensive because each time you propose a test value You're gonna have to refit the model and so if I fit a deep learning model, that's very expensive I agree with that and that's why there are many other proposals in the literature that will bypass Refit including my favorite one the jackknife plus which we developed with my collaborators or CV avoids Refitting things all the time. All right, but let's keep let's stay with conformal full conformal prediction for net and I'm going to show you that it's actually work. So this is my most mathy slide. So so here is so so we have exchangeable training data and Does everybody know what exchangeability means? Let's assume that they are IID exchangeability means you Equal random variable Okay, I'm going to try again with this have a stick when changeable random variables are as follows If I give you the realized value of the sequence and I tell you and ask you what is in this bag of realized value What is the first observation and the second one and the third one you would be guessing at random? That's what exchangeability means Another way to think about this is that the distribution function is symmetric in all of its arguments The best way to think about exchangeability is to say that if you see the entire sequence, but unordered You have no way to tell what is the first observation? What is the second one? What is the third one and so on? so it's a relaxation of the IID assumption and So let's say I have exchangeable data And so here it is x1 y1 x and yn and then I'm the test point xn plus 1 yn plus 1 I'm going to fit the model To all the data point including the test points. You say what a manual you can do because you're trying model Well, this is a bit capricious. So I'm going to stand up here and hopefully you're gonna you're gonna Can you hear me? So we're gonna train the model with a test point and all the training points And now I compute my residuals And if my training model is symmetric in the data point does not depend upon the ordering I gave you Then it's a very simple exercise to see that The residuals are exchangeable In particular If I ask you what's the chance that the test residual is in the bottom 90 percent of all magnitudes Well, you say Because it's equally likely to be any one of the of the of the residuals that I've seen Well, by definition it's 90 likely to be in the bottom 10 percent Right, so this is by definition of the algorithm the fact that you use a symmetric algorithm And the fact that you have residuals that are exchangeable in particular the test residual Is equally likely to be any one of them And it has a 90 percent chance to be the bottom 90 percent All right, so that's easy enough And now we're going to say all right. What did we see on my picture? We see something that looks like this I'm going to propose to you a test value y which belongs to r I'm going to fit the model mu hat to all n plus one data points And I'm going to compute the residuals and now I can compute the test residual with the proposed value because It's little y which is here here And I'm going to check whether when I plug in this little y I'm in the 90 percent bottom 90 percent of not if I'm in the bottom 90 percent I keep y in my prediction set if I am not I will reject it from my prediction set Right, so if it's in the bottom 90 percent we keep it And if it conforms to other residuals And of course the result is this the result is that the chance that this procedure includes the true label Is exactly 90 percent it is not 92 percent it is not 88 percent it is 90 percent and why is that because if I plug for little y y n plus one Then by construction it is exactly 90 percent And so you've achieved Exact predictive coverage Is this clear and so it works All right, so this is the full conformal procedure as proposed by wof in 2005 And as I mentioned before it's not computationally very attractive because you have to do a model refit each time And people prefer not to do that So one way you can actually avoid totally model refits is Is this which is you give me a data set a training set And then i'm going to split it into a training set and a calibration set And what i'm going to do is i'm just going to fit my model On the first split on the training set And then i'm going to calculate the magnitude of the residual on the calibration set with the fitted model And if you follow what i'm saying it's it's exactly what we've seen before except it's simpler Which is that it's like saying that the model mu hat has been fitted once and for And so what i can do is i can use a split to learn a model mu hat I can evaluate residuals on the calibration set the holdout set And then i'm going to Employ exactly the same procedure i'm going to keep all y such that Their residuals if i impute y is in the bottom 90 percent And if anything the argument for why it's works is actually easier It's just saying that mu hat has been fitted once and for it's been pre trained and that also works that yields Exactly predictive sets with 90 percent coverage And what's nice about this procedure is that of course you fit a model only once Now what you lose is you lose some data Because now a data point is used either for model fitting or calibration Not both before we were actually using Data points for both the model fitting and calibration So there's a loss of statistical efficiency here And that's why with my colleagues we have the jackknife plus which i'm not going to discuss in this lecture That actually can't reuse us in a smart way data points for both Calibration and model fitting without the expensive cost of refitting a model each time Okay All right, so If you do something like this then you're going to get Coverage and so here we see a data set which is artificial We have an x value. We have a y value. You can see that the data set has A bit of geometry. It's not homoscedastic and everything And I don't know why it doesn't come out so well. Oh, yeah And so what you can do is you You can actually fit the mean function Which is what i've proposed to do in the previous slides So we can fit a model to kind of discover what is the conditional mean of y given x And then compute residuals and get prediction intervals But what it will do is it will actually do if you think about what we're doing we fit a model We compute residuals and then we build a band around the model of a size that has been Determined by the quantize of perhaps hold our data And so that would construct a predictive band of constant widths Which does not really adapt to the geometry of the data and so Instead what we might think about doing is to start and rethink the algorithm completely from the ground up And say well, perhaps One thing that i never understood about the work of latime evolved is if you want to this to deliver Prediction intervals, why do you start by estimating the mean? You know you're clearly after the quantize of the conditional distribution Why should you start by estimating the mean? So perhaps we should start by estimating the quantize And so here you see You know an estimate of the upper quantile of my distribution You see an estimate of the lower quantile of my distribution But i might ask you you know i got this by random forest is this well calibrated and of course it is not But we're going to pass this through the same technology that we've just seen We're going to look at our initial guess and we're going to calibrate it so that it acts actually exactly returns 90 coverage So how do we do this? Well, we're going to fit quantile regression Then we're going to look at some conformity scores, which not are not going to be the magnitude of the residuals They're going to be something else And then we're going to calibrate them All right, so let's do that And this is going to yield a method that we call conformalized quantile regression Where we're going to get intervals that i adapt to the geometry of the data They have exactly the same coverage properties than the previous method, but that you can see the The length is very adaptive or the previous one is not Okay, so how did we do this we did this by essentially Extending a bit the framework of conformal prediction to say well You don't really need to use residuals as conformity scores. You can use any statistic you like And so if you follow a bit what we've been saying, which is like what you can do is you can pick any conformity scores function s of x y You apply the algorithm before x of x y was for example the magnitude of the residuals But you can pick anything you like You apply a symmetric algorithm to the data points and a hypothesized value y you get conformity scores You look at the 90th percentile That's the number q and you're going to include y if the conformity score applied to the test value y is in the bottom 90 percent And by exactly the same argument as before It's going to work the coverage is going to be again exactly 90 percent So how did we get this interval over here? We got it by Well, what i'm going to do is i'm going to estimate a lower quantile and an upper quantile They may be miscalibrated, but i'm going to recalibrate them by use of a clever conformity score And the conformity score in this example Is simply essentially is a distance of a data point to its nearest quantile. It's a signed distance To its nearest quantile So it's has a positive sign if you're inside The region and a negative sign If you're outside of the region, so it's a distance to the nearest quantile And so What you're going to do is you're going to exactly use a strategy We talked about which is you're going to include the data point if you below the 90 percentile of this conformity scores In this case, I did the math for you and it means that what you're going to do is you're going to move Up and down your quantiles until by by an amount q which is a the the The quantile of the conformity score. So for example, if you did a very good job And you are well calibrated to start with you will not touch it If you are a bit too optimistic That is the coverage was too low you would enlarge your prediction band and if you were too conservative you would actually shrink it And and no matter how you get exactly 90 coverage All right, so we can apply this to any data set you like And so here's a data set where the goal is to actually try to predict medical expenditure from a number of Variables such as your age your marital status your race your poverty status Your functional limitation the kind of insurance you have and so on and so forth And it's a kind of fairly high dimensional data set. We have about 140 features and 16 on 16,000 subjects and You measure healthcare's utilization as measured by the number of visits to a doctor's hospital to a doctor's office or hospital and so on And What you get is you get very boring box plots and I love boring box plots Because what they show Is that you know if you want 90 coverage no matter whether you use a residual method I showed you first or the most sophisticated cqr method They both give you 90 coverage And you want 90 you get exactly 90 Now the advantage now is If we look at the lengths of the prediction interval Then what conformity scores you you're using starts to matter a lot And so what we can see at least on this example is that if we were to use cqr Because it's more adaptive you would get intervals that are more informative. They are shorter Not only this but if you cqr You have Although the theory cannot predict conditional coverage that is coverage given the value of x that cannot be done In a model-free fashion We can still test for conditional coverage on on empirical data And the empirical coverage that the cqr algorithm achieves is very close to the nominal level And it's higher than what you get by residual quantiles, you know the residual quantile method tends to over cover When you have lower volatility Over cover when you have reduced volatility while the cqr method is a bit more adaptive So it seems to be A better algorithm and that's sort of what people use nowadays and that's what the washington post uses Now you'd say well, can you move beyond? quantitative outputs and the answer is yes So let's say that now I want to do something a bit different where you know, maybe my label is an object class for example Um, okay. I'm going to explain how we did this. So Let's start with my example first, you know, we have objects and we try to recover objects in images And we're going to see a conformal method that just does this and so you have three images And here is what the conformalizers says You look at the first image and the conformalizer says i'm pretty sure that it's a fox quirk. That is my 90 percent predictive set Says has only one element and it's a fox quirk Now we can move to the second image And I hope you see the second image. Okay, the conformalizer says, well, I'm a bit less sure now And you know it might in my bag i'm going to put a fox quirk a gray fox Which we know this might be a gray fox a bucket and a rain barrel And by the way, you have a bucket and a rain barrel in the picture the conformalizer says i'm less certain about what this is And finally we can see the image on the right the predictive set has cardinality six now And it includes a marmot a fox quirk a mink and whistle a beaver and a pole cat And you know we have thousands and thousands of images like this and each time we have a conformity score If I were to evaluate The probability with which the true label is in the set it will be exactly 90 percent So now how did we get this conformity this predictive sets? Well using a simple method which has exactly the flavor of what you've seen before It takes a while to realize that it has exactly the same flavor, but it is exactly the same thing So for example, I could say well, let's train a neural net to predict class label given features given pixel intensities And so I have a probability pi hat of being of class y given the image I've seen And this could be the output of a soft max layer in a neural net And so an uncaliberated guess might be something like this Well, I look at my neural net and it says well, there's a 50 chance And of course these probabilities are believe me. They're really when you fit neural net, they're really uncaliberated But we don't know that yet. So we could say well, you know, the neural net says 50 chance of being a 30 percent chance of being b and so on and a naive way would say well, I'm going to believe My neural net and I'm going to add well, I'm going to look at I'm going to put together the most likely labels Until I reach 90 percent. Of course, this is not calibrated So what would you do instead? Well, you're going to conformalize this And the way you would conformalize this is to say all right I'm going to have a free parameter here, which is How much mass should I include in my predictive set so that I get 90 percent calibration? And so you can do think about the start experiment like this, which is Should I include the top 90 percent the top 95 percent the top 98 percent the top 85 percent? Who knows? So I'm going to conformalize this And so what you're going to do on the x-axis you're going to put The fraction that you should keep On your holdout set you're going to check How much coverage do you achieve? And so if you want 90 percent you might realize that you need to use the top 95 percent classes And if you do this Then uh, you're going to choose, you know, if this is a plot you see You're going to choose 95 percent to get 90 percent on future data points And this is exactly of the form you were seeing before it's not clear why it's exactly of the form you've seen before But it can be cast in the form we've seen before and once you can Create a prediction set in this fashion then You have exactly the coverage you like And so you can get stuff like this and I don't know if you can see this but And so now, you know, we go through video sequences and the camera sees a lot of things And then each time we say well, you know, here's what you see and we give you prediction sets And when correct 90 or 95 percent of the time you choose All right, so the car knows with some form of uncertainty quantification. What's in front of it? So these have actually real application On my last slide, I don't know whether I'll have time to reach my last slide On my last slide, you'll see that I was surprised to see that if you look at a lot of companies They're actually using this very intensively now. And in fact, they release software For the ai community to actually do on certain complication using these ideas. I'll show you a slide in the end Okay, so now you have Really many many ways of actually assess uncertainty Um, you know, you can use full conformal split conformal the jackknife plus you have a choice of conformity scores And so on so you have an array of possibilities It's a framework which with which you can do stuff And all this method will Achieve prescriptive coverage. They all all achieve 90 percent And so now you can have a basis for comparing methods And if we were to look at this, you know, maybe a basis is to say, well, maybe the method I prefer is the one that Actually give me the shortest prediction intervals And so if we were to do that fashion this data set, which are very typical data set That there are news in the machine learning community, you'd see that it's a good idea to use the jackknife plus for example on fashion and this with a neural net Implementation for example All right, but this is now your your your choice. You have a battery of of things you can implement and And then you can judge them. They all achieve what you want and now you have a basis for comparison All right, so if we return to election night Uh, so here's uh, it was my introductory slide. We're going to do a thought experiment And the thought experiment is this We're going to have a training set So I'm going to draw counties at random 1200 of them and I'm going to so these are the Green counties and I'm going to have to predict the white counties 1900 of them By definition, my data set is exchangeable. It's like the urn model. I have the counties in an urn I draw counties at random by definition. I have exchangeability so the theory applies And I'm always in awe with these numbers because the theory applies and it is really good in the sense that you know In my first draw of 1200 counties, I evaluated coverage or 1200 of them and whether I use the residual quantile method the residual method of CQR I get exactly 90 and I did this 25 times And here are my box plots so that it really works now Is this what happens in practice? No Because during election night, we are not drawing counties at random from an urn There are some counties that are more likely to appear early These are the less popular county counties on the eastern side of the united states And so the urn model does not apply And here I'm going to show you I think an experiment which is that instead of drawing counties at random. I'm going to break exchangeability And I'm going to break it by just selecting in my training set All counties in the eastern time zone of the united states And these are all the green counties you see here And I'm going to use the outcome of the election in those counties to predict all the other time zones in the united states All right, and that doesn't work so well. It's not catastrophic But that doesn't work so well. That's not what we want And you know, I want 90 coverage the first method achieves only 74 the second one 78 that doesn't work I want to say that this is a proof of non exchangeability If if the counties Eastern counties If things were changeable you would get 90 the fact that you don't get 90 is a proof that counties in the united states are not exchangeable And so This is a certificate and so we need to do something about it And so this is of course a problem that is extremely important because You know, can I always assume that the data I've collected so far is representative of the data I'm about to see and of course this is perhaps the theme of this conference, which is I have a training set I have the data I'm yet to observe and maybe there's a distribution shift between what I've seen and what I'm about to see Okay, and so we're going to start talking about that So I don't know whether I'll have time to go through both Maybe I'll go through the first one. The second one is is is interesting as well But let me just give you some ideas about how to deal with distribution shifts And it's probably going to be different from a lot of what you are doing so this is work with isaac gibbs and To go back to conformal prediction, uh, what we can assume is that if I have idea assumptions Then The histogram of conformity scores I've seen so far And conformity scores I'm about to see while they're perfectly aligned But if I have a distribution shift That's no longer the case For example, I might be interested in I might do actually finance And it might be that I'm entering a period of low volatility over high volatility In which case the errors that the model makes They're they're not stationary anymore And so what I have is perhaps I have in green and in red in orange. Sorry the true distribution of conformity scores I'm about to see And in in purples are historical ones And now remember we're doing things in very high dimension x i can have tens of thousands of covariates And and and so on and I'm not willing To model this distribution shift in a distribution manner So the the the thing we're going to try to do is very simple And I apologize if it's very simple, but it's very simple. I said, well The distribution the true distribution may be shifted to the left It might be shifted to the right I don't know that But in any case what we're going to try to do is we say, you know Perhaps I should not use a 90th percentile of the purple distribution the one I have If it's shifted to the left and I can try to see that it shifted to the left Maybe I can use the 85th percentile if it's shifted to the right. Maybe I should use a 95th percentile So what we're going to try to do is we're going to say we're going to still use the same algorithm But I'm not going to take the 90th percentile of the conformity score. I'm going to try to track The the percentile that I should be taking And so the the key idea is there is if if I knew The the the the point the quantile of the red distribution Is that which quantile of the purple distribution? If I knew that I would apply the right quantile of the purple distribution and be perfectly fine So we're going to try to learn the quantile of the observed distribution from the quantile of the empirical distribution Now it sounds Prosperous, but you'll see that it does something so The probably it's the most simple equation I've ever wrote in my life We're going to try to track the quantile That we should use the quantile of the purple distribution And we're going to try we're going to use track it through sort of a form of online learning for those of you who know Who is online learning It's also a form of control And so what i'm going to do is i'm going to say well you should use alpha t plus one alpha t plus one is the error So alpha t plus one is i'm going to write a simple equation is alpha t plus gamma step size Alpha minus the error rate and what is the error? I just constructed a predictor prediction set using alpha t Did I include the prediction or did I not include the prediction if I included the prediction? Um That is I covered then probably i'm using a quantile that is a bit too large so i'm going to reduce it If I made a mistake My intervals are too short. I need to lengthen them a little bit. That's the interpretation So you see it's extraordinarily simple If I I look at my errors And based on the errors I see the error process I see I react and and and and shift my quantiles left or right Now this can be understood as a form of online gradient descent I'll explain this in a minute And I can use this inside to select what in the field people would call their learning rate in a boosting style fashion But you see the argument is very very simple Let's try to track the quantile Of the purple distribution to to find the 90th percentile of the unseen red distribution By looking at the error rates Now if I do this for the election This is what I see so I've ordered the counties from east to west And i'm looking at an average over 300 consecutive counties Of the of the of the prediction error And so in the red The red curve is a moving average if I don't do anything if I use the method that you've seen before What i'm always going to use a 90th percentile of the conformity score And here what you see is as you enter the midwest and now nothing works anymore So okay, so this work as you enter the midwest you lose coverage And why is that because the midwest does not look like the east coast And you know you get out of the midwest and then all of a sudden you overshoot and then you undershoot when you When you reach the pacific coast and so on If you try to track the quantile in an adaptive fashion you get the blue curve And the blue curve seems to be hugging the target of 90 percent much closer And in fact I would argue that it hugs it so well That it's almost Undistinguishable at least statistically speaking from what i'm going to call the gold standard and the gold standard What is it here the gold standard is at any given time? I give you the true quantiles and so You report the true quantiles and any given time you have a probability 90 of being in the in the set And if you were to do this you would actually get the the the gray curve you see here And so this is the gold standard this is it you know, I know everything about the world I know the quantiles and this is the error rate you would suffer And what we can see that the excursion of the blue curve is sort of comparable to the excursion of the gray curve Meaning that you do something which is almost as good as if you had an oracle We can do stock market stuff. So again here we try to predict the volatility of stock returns Through an enormous period of time that actually covers the 2008 crisis And what we can see is that if we do not adapt conformal prediction We get the red curve. So this is fanny may Fanny may have enormous problems in 2008 And so you can see that if you do not adapt to a changing world, you know, you're going to dramatically undercover But if you apply the method that we just described You hug the 90 9 and sort of you're sort of on par with sort of the gold standard And so you see different stocks amd and vidya blackberry Blackberry of course the same thing went under in 2008 And then of course if you don't adapt you do terrible But if you try to track the quantize you're going to do much better Okay So one result that is pretty cool in as actually an easy result Is this which is that you make no assumption now about anything. There is no assumption of any kind If you apply the simple method that I just showed before Long term you will get exactly the right So if you want 90% you'll get 90% that is on average over time This method will give you 90% coverage no matter what And yet what's a bit surprising about this result is that there's no assumption of any kind Things can be deterministic for all I know all right Now there's a connection to online learning and the connection to online learning is this Which is that you can actually interpret the tracking of the quantile as A form of gradient descent And you'd say well, it's a gradient descent apply to what? Well, it's a gradient descent applied to a strange loss function, which is a Pinball loss and a strange kind of random variable, which we're going to call ut So The pinball loss is the kind of loss function that you use when you fit quantiles And so it's represented on this thing And ut is essentially you ask What is the largest as the smallest level at which you contained a future observation? That's a random variable that you cannot observe But you ask what is the maximum you said that yt is actually included in your prediction interval at level one minus So it's the smallest confidence level at which you include The variable y in your prediction set Then with this random variable ut it's a random variable because y is a random variable With this random variable ut Then what you have is that you can rewrite Your your update as a form a gradient descent on this loss function So basically what you're trying to do is you're trying to track the quantiles of this variable you Okay Now that you have this you say okay now that I have formulated this as an online gradient descent algorithm One thing we have not discussed is which how fast should you react? So what should be the value of this step size? And now that you have a model in terms of Of online gradient descent you can say well, I don't have to make a choice I could have several agents and they have all their view of what gamma should be And what I'm going to do is I'm going to follow the soft leader I'm going to you know, I have several candidates There are people who are going to react fast people who are going to react slowly And I'm going to look at how well they're doing as far as these losses are concerned How are how they're tracking the quantile of these u random variable? So I have a loss function, which is just simply the pinball loss Evaluated at this random variable u And so what I'm going to do now is I'm just basically going to follow the person who actually is doing well Who has the smallest loss? Or a soft version of this And now I don't have to choose gamma anymore. I can just do this in a boosting style And just have a bunch of agents they compete against each other and I'm following Depending on the time I'm going through the person who happens to do the best at that time And when you do this, uh, you know, it thinks work well again So this is now like a more modern application where I'm trying to predict the trajectories of COVID-19 cases from earlier counts count that From two weeks before stuff that I see on facebook and things like that And you know, you have the blue curve and the blue curve is doing this sort of now this adaptive algorithm with an automatic update of step size according to this boosting iteration And so we have several counties here. We have san francisco. We have miami We have new york. We have dallas and I don't know if you go to the united states But I can tell you that san francisco is not behaving like miami when it comes to the kovid crisis And yet the algorithm doesn't really care it understands and it issues predictions that are always close to the 90% target Over time even though we experience tremendous change in behavior during these time periods All right, so I think that my time is up I had a second a last part, but I don't think we need to go through this Okay, so I'll skip this part and I'll just say that Um The takeaway message is That there is an enormous industry of people in academia and industry trying to build Uncertainty quantification around machine learning systems um I think I really love this framework because it forces you to be honest about what you know And about what you don't know is the prediction intervals are wide you have not learned that much and you have to be honest about that um My estimate is that there are about two three to between two and three thousand papers on this subject published each year now So it's a big subject There's an explosion of interest in industry I don't if you see my slide is a slide. I promised aws, which is amazon web services Is actually offering software to implement a lot of the stuff that you've seen So that people can safely quantify the reliability of the ai systems A lot of it assumed that data is exchangeable, but that's not always a case Um, you know, we have to be worried about distribution drifts I just showed you a little way of dealing with this. There's a part that needs to skip But since I'm over time, I thank you for your attention