 for the invitation. And I just wanted to correct you. I think we met actually in 1999 or 2000 when I was a PhD student, but you may not remember that in any case. In Brighton I think. I was with Keith in Brighton. Okay. Well, thank you for the introduction and thanks for coming today. Today I'm going to talk about some work I've been doing and give a talk that I've been giving in various statistics departments around the country. And it's gotten a few puzzled looks there because it's a bit of a different idea about inference. And today I'm going to try and give a spin on that work as it relates to some, maybe to some quasi-practical issues in data science. Okay. So you can find the notebook for the slides for this talk with that link there. Okay. So I'm going to talk about what you might say is a prototypical applied statisticians daily task given a data set to try and come up with a predictive model, maybe interpretable predictive model, and maybe try and give some sort of traditional reports like a confidence interval or p-value. But with the twist that I'm going to try and actively recognize that we don't actually typically fit one fixed model. We have some algorithm out there that we're going to use to choose which model to report beforehand. And so I'm going to talk about using the lasso, which is, you know, I presume most people have heard about it. I'm not going to spend much time about the lasso today. I'm not going to show you the typical picture of the lasso, but it's the algorithm I'm going to use. And if there are specific questions, I'm happy to answer them. And I'm going to try and point out the issues that you might worry about if you use something like the lasso to choose a model. And I'm going to try and explain why a data scientist should care about these things. And then I'm going to try and explain a little bit about selective, what I call selective inference, or the field called selective inference, and how it relates to this problem. So after that, I'll talk about, I'll have a few more examples about a slightly more general framework for selective inference. As this is in a pure statistics audience, I'm not going to go into great details, but I'll give a few examples, or at least one example of a general framework that's actually somewhat familiar to us. We'll see when we get there. And I'm going to talk about some more recent work where I randomize my response vector before doing inference. Okay, enough about the outline. Okay, so first of all, there's a few papers I mentioned there, but I've been doing a lot of work on selective inference in the last few years. And these are some of my co-authors. And what I'm going to talk about today is based on joint work with many. So here's a long list, and there's others who I probably should add that I haven't added yet. Okay, so here's the example, the running example. So we're going to look at in vitro measurements of drug resistance in an HIV virus population. So that is, there's a Dr. Stanford who runs HIVDB. It's a database that stores various information from different patients from studies around the world. And some of the information it carries is for their particular, the virus in their blood, it has sequences for those particular patients. And of course, there's a lot of variability to the HIV virus. So different patients have different, slightly different patterns of mutations on specific part of their viruses. And we're trying to understand how those mutations affect drug resistance. Okay, so this is, now I will forget if this is an NRTI or Protease Inhibitor, but one particular drug, NRTI. NRTI Drug 3TC, it's a particular drug in the market. And we're going, we have 633 cases, so 633 patients' viruses. For each of these viruses, we have a measurement of resistance. And we also have their mutation pattern. And I've chosen, I've taken 91 different mutations. And these are all the mutations that occurred more than 10 times in the data set. And these are site-specific amino acid-specific mutations. I'll show you a little bit of the data in just a second. And my goal is, you know, to build a model with this data set that's interpretable, a predictive model, and to give something, some notion of how the accuracy we can estimate the coefficients in this model or some statistical assessment of the importance of the variables that are in this model. Okay, so here's my design matrix. I'm doing regression. I have 633 cases and 91 columns to my design matrix. And what the features look like, well, they're associated to a position on the reverse transcriptase that's a part of the HIV virus. And they also have an amino acid-specific telling which mutation, from wild type to which amino acid that we have. So these are the kind of features I have. And I have 91 of these. These are different features that showed up, you know, at position 43 there was a mutation, there's at least 10 mutations to the amino acid A, N here. And I don't know what N is, but I hope you'll forgive me. Okay, so now enter the applied statistician, or you maybe enter the data scientist, or I'm not sure what term to use here. But at this point, I just have a design matrix and I have my response, which is also 633 long. And I don't necessarily have a model in mind. That is, I don't have a traditional statistical model that I want to use at this point. I have 91 mutations. I, you know, they're probably not all important. I probably want to present my PI, some nice summary that doesn't have too many variables, but I want to have some, I want to get rid of some of these mutations and use a smaller model. And the point is right now, I don't have a model at this point. So I won't just use traditional inference for that specific model, because I have to find my model first. And I would say, more and more, this is the norm in modern science or slash data science. We're often collecting data that's much, much bigger than this data set. I'll acknowledge this is a toy data set, but it's, I chose it because it's small and concrete. And there are lots of data sets of this size out there. It's not big data. But there are lots of examples where we're collecting huge amounts of data, and we don't necessarily know what questions we want to ask about the data to begin with. We don't know what kind of model we want to fit to the data when we collect the data. We're going to use some algorithm, some machine learning or applied statistic algorithm to choose a model. So we're going to explore the data first, then we'd like to summarize our results and present it to the PI. And we'd like to give some confirmatory or some confirmation of the validity of the model that we've found. And this is the sort of conflict that we have in the way I describe model building is that, classically, you know, if you follow through a year of statistics courses, you'll learn lots of things about hypothesis tests and confidence intervals. And all of that theory, I mean, that's a frequentist, there's Bayesian as well. But all of that theory really assumes that you had a particular question in mind before you collected data. And I argue that most of the time now we're collecting data before we have a question to ask. And so, we can't really use these traditional confirmatory analyses because we're violating this assumption that we had the question before you looked at the data. And of course, you know, we're all aware that if you look at the data before you decide what hypothesis to test or what confidence to perform, there are many pitfalls. And I'm just going to go through some examples and just remind you in some sense of what you already know. So, you might say, what we're doing is, you know, this is sort of common saying in statistics are relatively common. If you torture the data enough, nature will always confess. And that sounds like a little bit, this is maybe a little bit generous reading of that quote. That sounds like we will find the truth if we use a sophisticated enough algorithm. There's another version of this quote that's due to someone else that says, if you torture the data sufficiently, it will confess to almost anything. And that's probably more, I would say, a more realistic thing. This is attributed to Fred Manger and that must be a mistake. That's not Fred Manger. This is actually a picture of Fred Manger, a chemist at Emory. Okay, so I think in many instances, we're really sort of encouraged to torture the data somewhat to get some idea of what questions might be interested to ask before we actually ask them. And we should be wary of this saying, of course, that we might confess, but we need to know whether this confess has any actionable information in it. Is there anything still there, even though it's confessed to something? So that's, in a nutshell, what I'm trying to provide in this framework. Okay. And some would say, well, why should I worry about this? We have all these neat algorithms that extract information from data sets that provide pretty cool looking pictures that tell a nice data story. And that's important in and of itself, there is we extracted some information from the data sets. But that information, it's not really up for traditional confirmatory analysis, we can't really use that information. And go to a journal and say, well, here's a, here's a p value for this word cloud of coefficients, or what's important in predicting something or other of my model. Or at least we can't if we want with using traditional methods, we can't you we can't report things like a p value or a confidence interval. Many people would say, maybe you don't want to report a p value, or you don't want and there's like, there's a, you know, a large, some movement in applied sciences to get rid of null at what NHSP or null hypothesis statistical testing. And they say, well, we're just going to you we want you to report confidence intervals instead. That doesn't quite absolve you because if the p values are wrong, so are the confidence intervals going to be wrong. So even if you don't like p values, if you want to report confidence intervals, you should still be worried about torturing your data. Okay. So I think I think many parts of the scientific community still even if you may want to have p values reported for particular questions, particular hypotheses. And if not, they at least want confidence intervals that they can give some idea of the accuracy of the estimates that you might want to report in your paper. And so we need something that besides extracting cool information with nice visualization, we need something that will go back and fit at least in the somewhat traditional scientific method of producing a confidence interval or a p value. And so that's what I'm going to that's again, I'll repeat that's what I'm trying to do. And I'll try I will get to it eventually. Okay, so now let's get to be a little bit more specific. Let's talk about torturing our data and who are two big enablers in this world. These are two of my colleagues at Stanford, tips around in hasty, I should have put a picture up with the book elements of statistical learning, at least in the statistics community, there have been great proponents of using complex models to to fit regression, at least in the regression context that I'm talking about. And they, you know, the, the most common one is this algorithm called the lasso. And elastic net is a variant of that. And here's, I'm not going to go into great detail, but the lasso and the elastic net, they are optimization problems. So earlier why this is my vector 633 long, this is my design matrix 633 by 91. And we're going to find beta by something like least squares, but it's least squares modified by a com a convex penalty. Okay. And then, well, there's these two parameters, these tuning parameters that are ubiquitous in complex models, we have to choose them by some way, so maybe by cross validation. And so in some, I would say this is an example of what we're torturing our data, we might have 100 different values of lambda one, 10 different values of lambda two, and we're going to find the one that does best by cross validation over all that large grid of data. And then after that, we'd like to go back and report something based on the model chosen there. But we don't have the tools to begin with. And I should say, well, these this parameter, this model has two tuning parameters, you know, many others are worse, I, and I wouldn't put myself in this same club as Robin Trevor, but I had one that had three parameters that I had to do. So I mean, I'm not saying I'm not coming, I'm not proposing getting rid of models with tuning parameters, but I want to try and give inference valid inference after using these models that have tuning parameters. Okay, so now let's see what, what kind of thing do these, this lasso or elastic net, what kind of information does it give you? So here's a sort of an example you might see from someone who, I mean, this is not a real necessarily scientific example, but this is an example of the kind of information that you might get, you might extract from a regression problem with the lasso. So this word cloud here is a word cloud where size measures the size of coefficients in a model where we're trying to predict a review score, in this case from IMDB review database, based on the words in the review. And this is for the sub-genre of horror movies. And so you see things like blob and other scariest, those are obviously important, probably important words. And in many cases, I must confess, this is sort of the end of the road for these methods, right? We have a data set, we use some nice comp, you know, rich algorithm to, to find a model. And then we find coefficients, here we can see the size of the coefficients. And we sort of stare at them for a little while. And that's kind of the end of the road, at least in my, in my experience listening to applied versions of using the lasso. And this reminds me of something that, I guess, in work of Matthew mentioned earlier in brain imaging, there's something that was kind of commonly done in brain imaging is we would threshold an activation map. And this would, this would not be words, these would be blobs in the brain. And then people will stare at the blobs and try and make a story and write a paper about that. And that is, so this, this is an analogous kind of picture. And so now when we look at this picture, we see some coefficients are large, some coefficients are small. There's some information in the data set. But is all the information that we've extracted here, how much of it is actually valuable? And if I'm thinking of this as torturing the data, how much of the intelligence after I gathered by torture is actually actionable? What is really left there? And we don't know in principle. And so I'm going to, I'm going to try and give you a way you can assess how much is valuable after you've extracted it from the, from the data. Okay, so just I hope all of you will agree that when we take, look at this information I've got from the data set, we can't just use naive inference along with these algorithms. And what do I mean by naive inference? Well, once I've chosen a subset of variables with the last or something like that, that's a usual regression model. And I could use any statistical software out there to fit regression models that would report a p value for me, or report a confidence interval for those coefficients. And I could, I could use that as my report. But will that have any statistical validity? I think most of us should hopefully say, generally speaking, no. And so let's just take a look at how bad this might be. So here's a simple example. I took this is synthetic data, not the data that I started with, I had 100 by 200 feature matrix. So not too, you know, it's not big data, it's just, you know, medium sized data, or maybe small data these days. And I'm going to generate a model for which the lasso is correct. So I'm going to generate a response vector y given the design matrix that has some coefficients beta. So y given x is going to be normal x beta with some independent errors with variance sigma squared. And I'm going to solve the lasso at some value of lambda. And I'm going to look at two different scenarios, one in the case where beta is zero. And that's an example where I would have collected a data set, whereas my response really has nothing to do with the design. And I don't know how many of those examples, data sets there are out there in the world, but there certainly are some on the computer I can make. So we'll see what happens in the global null. And then another example, I'm going to look at the alternative where beta now has seven non-zero coefficients, seven is not chosen for any particular reason, and they all have value seven. And that's also not chosen for a particular reason, but in any case that's the default settings for some reason when I generated this data set. I wrote the code so I put seven there. Okay, and then, so now what's going to happen, we're going to fit the lasso. And in this case, there is a scenario one, every model, every subset of variables I get when I fit the lasso when there's beta is really zero, that model is formally correct, right? Because it's just that maybe I include 10 variables, but they all have zero coefficient in truth, so that model is still correct. And then if I got 10 variables, I would have, if I was using a regular statistical package, I'd have 10 different p-values that test whether those coefficients are zero. And I could see, well, do they look like valid p-values? Can I use them to form a reasonable test? And scenario two, well, in this case, there are seven true non-zeros, and because I generated the data, I know what they are. But suppose now I got 10 variables and I got, I captured at least the seven true ones, then there are going to be three coefficients who are, whose true value is zero in this model. And again, using this traditional statistical software, I could have produced p-values for those three coefficients. And I could see, well, does it look like I can form a reasonable statistical test using those p-values? Okay, let's just take a look at the two scenarios. So this is scenario one, that the axes aren't labeled very well. So this is, this is supposed to be, this is the, the distribution of the p-values. And if I had a properly calibrated p-value, I'd like to see this plot be lying along the diagonal. Instead, we see that these p-values are much smaller than the uniform distribution. And at 5% which we can't see, unfortunately, there's about a 60% chance. If I, if I threshold these p-values at 5%, there's about a 60% chance I'm going to declare something significant. So my type one error is about 60% here. So this is when the truth is completely zero. Now let's look at it, that alternative scenario. And in this case, it turns out at 5%, it's somewhat, it's somewhat conservative. Again, these are all the ones that have two coefficients zero, and when I report the usual p-value from standard statistical software. And oddly here, it's somewhat conservative. It's much less than 60%, it's maybe 3% if I threshold them at 5%. But in reality, of course, I don't know whether it's the global null is true, or whether the seven-sparse null is true, or whether it's one-sparse or two-sparse. I don't really know what the truth is. So I don't really know what this picture is going to look like. I need to order to make usable objects. I need to have some reference. I need to know, I need to produce a test that's going to lie, whose distribution is going to lie on that diagonal. That's the goal here. Okay, here's another example. This one is similar, the same setup as before, but just in the global null. And instead of using the lasso, I'm going to run forward step-wise. So that means I'm going to take, and I'm just going to do one step. I'm going to take the best single variable model here. And because the global null is true, all the coefficients are zero, I would hope that my test statistic, when I produce this p-value, is uniform because there's nothing happening. There's no relationship between y and x. And well, let's see what happens. In this case, the type one error is about 98 percent or something like that. So 5 percent is somewhere around here. And we're up there at 98 percent. Okay, so what do these three examples illustrate? They illustrate that you can't just use the usual p-values that your statistical software provides for the model that you found when you use something either by lasso or forward step-wise. And the shape of these curves will depend on the algorithm you use in general. Question. So this is a, I was taking the, I have a, it's a, I have 200 features and I'm taking the best one feature model. So that's the, that's the feature that's most correlated with y. And then from that, there's a, there is a regression model that has one predictor in it that I could have used r to fit or sass or whatever or stats models. And it will give me a p-value, right? It's like, you know, it produces the z statistic and looks up in the z distribution where that observed z statistic is. And if, if I had not looked at the data to choose which one to produce, then that, that p-value would be uniformly distributed. And this plot would lie along the line here. But it's clearly not. So the, what's wrong here is, is using these, this, this traditional z statistic and converting that z statistic into a p-value. That's what's, what's breaking here. It's not giving us the proper, it's not giving us control of type one error. Okay. Are we making sense so far? So in these, in these examples, the, the point I'm making is a, that, that the tools that we had, that were, that were taught in regression courses are not applicable after using an algorithm to choose a model. And b, that the way that the, these things, these curves look depends on the algorithm we choose and depends on the truth. Yes. So it's because, so I've, there are actually 200 different z statistics, right? For each one feature model, there's a z statistic that basically tests whether the correlation is zero or not. And what I'm looking at is I've taken the, the largest z statistic in absolute value. And that doesn't have the distribution of one z statistic in absolute value. That's why the problem is, yeah. So, that make sense? Because the, the way I chose it with my algorithm, the thing I'm reporting is a function of the largest absolute z, and that obviously doesn't have the right distribution. Yes? Yeah, these were, so if I had 10, I only find three. Those are the, so the 10 has to include the true seven. And then I looked at the three that weren't out of those. Yeah. So those, normally if I had chosen that subset of 10 variables beforehand, those would be uniform. Right? Because I would have found the seven true ones that have three zero coefficients in my model, and those would be uniformly distributed to the p-values. Yeah. So I've conditioned on, in this alternative here, I've conditioned, I've only kept instances of the lasso where I actually found those seven. And then look, tick in the no ones from those seven. Okay. So, so the, the, the output of these traditional tools are going to be affected by the algorithm used. And so, if we want to do confirmation analysis, should we throw out all of these algorithms, all these cool algorithms that give us nice interpretable models when I have many features. So my, the point of my talk today is, no, we really shouldn't throw out these algorithms, but we really somehow have to address this conflict between the exploratory part of the analysis and the confirmatory part of the analysis. The exploratory part of the analysis is the part where I choose a model, and after having chosen the model, that's when I, after that, I want to do confirmation. And we have to somehow resolve this issue. So, so the term explore exploratory and out data analysis is, I think, due to Tukey. So this is a picture of Tukey. And description on his views on exploratory data analysis, he really was a proponent of, of these cool algorithms. Though I think he described exploratory data analysis as graph paper and a pencil. So he was a little bit before, I don't think he turned the term data science. He did define bit and byte, but I don't think he, and software, but I don't think he got data science. If it was data science, I don't think he would have used pencil and graph paper to explore, to describe exploratory data analysis. But he, he was a proponent of, of exploring your data. And he said that at the time there's too much emphasis in statistics on this confirmatory side. I'm trying to give guarantees that control the type one error some under some scenarios or other. So he said we should really start to look, we should use the data, explore the data to find hypothesis, choose the data to suggest hypotheses that you might want to test. And if you think about the way I chose my model before, using the last, so it chooses some subset of variables. And now I want to make a report, because that, the output of the last of algorithm has given me some idea of what might be interesting hypotheses to test. But he of course noticed, he noticed this conflict between, and as anyone sends him or probably before him, note there's obviously a conflict between using the data to explore and using it for confirmatory analysis. Because there will be systematic biases if you apply traditional confirmatory and approaches to data you've used to explore. And that, those were the examples I just showed you before. The traditional tools don't work if you've done some data stumbling, if you've done some exploratory data analysis. And this conflict is also reminiscent of this term, research or degrees of freedom, there was a paper I think 2011, Psychological Science on undisclosed approaches that researchers will do when trying to build a regression model. And they had some examples, much like the ones I just showed you, showing that it's very easy to to violate your type one error control if you start trying to choose a model based on the data. Okay, so now what's the solution? I think I probably used about half my time, I'm not sure how I'm doing on time. To describe the problem I'm going to describe, try and describe some of the solution. And the solution I'm proposing today is, comes under the umbrella of selective inference. Selective inference has been around for quite some time. I think we found a reference back into the mid-80s at a FETShrift for Eric Lehman here. But it's, and you have Benjaminie of the Benjaminie-Hockberg FDR algorithm, it's also worked on selective inference, but not quite in the context of regression models. They've been slightly different problems and maybe slightly easier problems than regression. So one of what I could attribute to him is, is the, says that of course if you apply an algorithm to the selected few, in this case the selected model I got from using the lasso to choose it, then the interpretation of the usual measures of uncertainty, so the interpretation of the usual p-values, are not, do not remain intact directly unless you properly adjust them. And selective inference is really, I would say, trying to properly adjust these usual measures of uncertainty that you do have some valid guarantees. Okay, so in particular the selective inference is going to, it allows us to test hypotheses suggested by the data. So I can use the same data to suggest hypotheses and to to do a confirmatory analysis. And that's something that we saw is, has many pitfalls if you do it naively. And these, in what I'm describing today, the hypotheses are going to be suggest, generated by some algorithm, so it's going to be some function of the data that determines what hypotheses might be interesting. And while I'm allowed to test hypotheses suggested by the data, of course there should be some kind of, there can't be a free lunch here. So we have, if you're going to use this approach, you have to declare the algorithm you're going to use to suggest the hypotheses. So in my case, if I'm using the lasso, I have to declare I'm using the lasso to choose my regression model. And once I fix that, then I can do inference after I've done the exploratory phase. So let's go back to our particular example. Okay, so this is a, these are a plot of the 91 different coefficients. On the x-axis here in tiny letters that you can't read from there, these are all the different mutations, and these are their ordinary least squares coefficients. So I'm, the type of inference I'm describing doesn't allow you to look at this before you decide what to do. Not, well, not the lasso anyways. You could look at this and do something, but anyways. Okay, so we see as maybe to be expected that some of these mutations have a high impact on resistance for this drug, and many of them are quite small. So this, this position here is 184v. It's a well-known mutation in the resistance literature. 65r is another one, and I don't really know what these individual mutations imply, but medical doctors in the, in the, who, and medical researchers in resistance know that, you know, these are well-known mutations that have their own stories about them that I am not going to propose. That I'm not going to describe because I don't know them. Okay, but I just wanted to point out that these are, you know, particular identifiable mutations are not just sort of feature, nameless, featureless variables. Okay, so in looking at this, it looks like we could probably get rid of many of these variables and still come up with a reasonable model. Many of these coefficients are zero. So why don't I choose an algorithm to choose a model and then report my findings. And as they said, the actual flow through this really should, I'm not saying you should look at this picture and then run the lasso. I'm saying I, I believe before I collect this data that there, a sparse model is probably, there are probably a reasonably good sparse model to describe the association between 3TC and resistance, and I'm going to use an algorithm to choose a model and then report the findings. I'm not going to look at this and then decide on that algorithm. But for this purpose of this talk, I just wanted to show you what all of the coefficients look like. Okay, so we're going to use the lasso to choose the model. So the lasso has the property that for certain values of the tuning parameter, we have sparse solutions. This is a, I'm not going to, this isn't a talk about the lasso, but if you've never heard about the lasso before, you really should look into it. It's quite useful, but it's, it's a convex optimization problem that has a tuning parameter that will produce sparse solutions for some values, and there's a lot of literature on the lasso. Some literature is telling, giving reasonable choices for where, what kind of regularization parameter to choose. So there's, I'll take some choice from the literature, some multiple of this quantity that I can compute by simulation, and that gives me a fixed value of lambda. In the, in this example here, lambda is 43 that I'm going to use. Okay, now in common practice, people will use cross validation to use, to choose lambda. If I have time, I'll describe some version of selective inference that you can use with cross validation, but likely I won't have the time, but I'm happy to go on into the question period with whoever wants to still listen to me. Okay, so here, for my problem, I'm using lambda's 43, and for that, all I needed was the design matrix. I also needed some estimate of the noise variance. For that, I'm going to use the full, I'm going to use the full least squares regression noise variance, that whether that's a reasonable choice or not is debatable. There are other versions of selective inference that don't have to, don't, don't need an estimate of sigma, but I'm going to use that estimate of sigma from the fully squares model, and that gives me this value of 43, and this is what the lasso tells me. It gives about 16 mutations that seem to be important, 65R, this was one of them, 184V, that's the other one we saw, those were two of the ones we discussed, and 14 others. Okay, so now I have a potential regression model with 16 variables, and I want to report finding from this model. Okay, first let's just take a quick look at the lasso solutions. I've also, I'm going to plot them on the same, on the same image, we, and we see the gray is the lasso, the red is the ordinary least squares, as is well known, the lasso is a shrinkage regression method, and it's sort of shrunk down the coefficients from the ordinary least squares to be slightly smaller. Okay, so now only 16 of the gray ones are non-zero, but 91 of the other ones, other red ones are non-zero. Okay, so what did I do here? I used lambda equals 43 to run the lasso, and I got these 16 variables, and now I want to generate a report for my PI that will have some measure, some valid inferential property, and so what I've done here is I'm declaring an algorithm, this is the algorithm I actually used to choose the model, and for various reasons I'm also going to report what the signs of the non-zero coefficients were, that's a sort of a computational simplification, but this is the algorithm I'm using, for my x and my y, and some value of lambda, I'm going to run this code, I'm going to get an active set, that is the 16 variables, and I'm going to get the signs, and based on that, after seeing that, I want to give you a report that has some properties similar to as if I hadn't looked at the data, that has some confirmation confirmatory properties. Okay, so here I just, here I run that algorithm, and just to confirm, that is the same as the active set I showed you before. Okay, so now I have these 16 variables, one thing I might do instead of I could report p-values maybe, or maybe you'd only p-values, I could report intervals, so here's a picture of some intervals, and what these intervals are, these are, what I'm plotting here is I've taken those 16 variables, and I've refit the usual regression model, that is I've used R with those 16 variables, and I've produced, I've taken the output of R's confidence intervals and plotted them here, so these are like the p-value, these have the same kind of problems that the p-values I showed you before, and that they have no particular statistical guarantees, so these are intervals, and what property should an interval have, it should have a property that it covers something with some specified probability, yes, yes, yeah. With no intercept in this case, yeah, so this is, I took, I had those 16 columns in my design matrix, so now I have a 633 by 16, and I could run stats models, OLS with that design matrix and Y, and I could find the confidence intervals from all, from that, and that's what I'm plotting here, okay, and well these have no properties that I know of, that is they have no coverage properties, just like the p- values I showed you when using the lasso have no properties that I know of, so now I'm going to show you what the proposed solution looks like, so here's some new intervals, the red ones are intervals that I'm trying to describe some of where they come from in the remaining time, and let's just take a look qualitatively at what they look like, well for this 184v variable, the interval, this red interval is almost the same as the black interval, so it says for this parameter it's almost, it almost would have been okay to use the ordinarily squares interval, it's basically the same interval, similarly for some of the other variables like 69i here and 65r, the confidence intervals are just about the same, but other variables like this 75i here, what we went from a fairly short black interval to a rather longer red interval, and this one over here, 190a looks quite long, and some would say these intervals are too long, maybe I'll have a chance to describe how you might fix that, maybe I won't, okay so now these are intervals, I said the black ones don't have any properties, what do the red intervals, what property do they have, well they have a property again that they cover something 95% of the time, and well what do they cover 95% of the time, let's see if I can describe it, well there for these 16 variables there is some regression model some noiseless regression model if I observed my response without the Gaussian noise, that 16 variable model would have coefficients, if I got rid of the noise would have true population coefficients, and these intervals cover those population coefficients 95% of the time, so there are parameter, well there are parameters that these things cover 95% of the time, so okay so the interpretation is if I if I use this algorithm many times and construct intervals like this, then I will have many many, and time accrues, I will be accrue many intervals over time, so the frequency property of interpretation of statistics of time, I'm running many lassoes on many different data sets, I get many different intervals, and for each of those intervals there's some number that it should cover, that is an interval, a confidence interval, it's a random interval based on the data that's supposed to cover some number, that's the usual interpretation of an interval, and 95% of the intervals will cover what it's supposed to, so that's what I mean by time, sort of doing many data analyses over a long period of time no, well so I mean it's when you say estimate, there's a point estimator that I would produce, like so you the usual these, so are you familiar with if I had not done, if I had not done any selection if I just were produced these confidence intervals, is that, so you the usual confidence intervals are centered around the estimate, the point estimator, plus or minus 2 times your estimate of the variability, that's the standard confidence, and that's the black ones, right, so they usually, they're usually centered around a point estimate, plus or minus twice the standard error, okay, but those ones don't, and so those intervals, they are random because they, they're based on the data you put into, when you call them, and they're supposed to, there's some, not estimate, but true parameter that they're supposed to cover, and 95% of the time when you do this, if you haven't done selection, it will cover that parameter it's supposed to, okay, so these intervals, these intervals are not centered around a point estimate, they're based on a conditional distribution that, but they have the same property, their intervals, so they're not estimate plus or minus 2 times something, their intervals, and there's some number it's supposed to be inside this interval, for 95% of them, it, that thing is inside the interval, so it has the same mathematical guarantees that the, that the estimate plus or minus twice standard error satisfies, so that, but it's not that interval, it's not estimate plus or minus twice standard error, so I mean the estimate plus or minus twice standard error is rather comforting way of thinking about confidence intervals, because, well what is it, it's, you have your estimate and you have how accurate you think your estimate is, and the confidence intervals kind of neatly summarizes that, right, but actually, statistically, or the mathematical property that that interval has, is that there's something that's supposed to be inside, and 95% of the time it is inside, and that's, this has the same property, it's still a parameter, so well, so now there's some interpretation that there are many different models I could have chosen from these 91 different columns, each choice of 91 different call of those, out of those 91, like, of any size, there is some regression model I could have fit, and that regression model has some parameters, and these are produced, if I, when I, the last one chooses a particular one, like these 16, there are 16 numbers it tries to cover, if it chooses a different one, maybe 20, there are 20 different numbers it tries to cover, yes, so, so when I say it's covering a parameter, that is if I fix the model it chose, there, and if, if I don't fix the model it chose, then, as I get different response factors, I'm going to have a different model coming out, and then in that case, this would be what statisticians usually call a prediction interval, so the thing inside, in the middle that we're covering is a random thing, because it depends on which why I got, yeah, so those are, statistically we would say, when I think of them as covering a parameter, I'm conditioning on this being the model I chose, and when I just think of them as covering a random variable, it's a marginalizing out of the choice of variable, so I have an unconditional statement, you pay a price that's wider, yes, yes, the confidence intervals are wider, not everyone is wider, right, some of the, and these you could think of, these are the parameters that might, would probably were easy to find in the model and easy to estimate, but there are some ones that are not particularly strong, that the lasso has thrown into the bucket, and you have quite a high uncertainty about that, that's why the intervals are wide, make sense? Okay, and as I said, some of these intervals are rather long, if it turns out if you do some randomization, these things can be made shorter, and that material is briefly touched on in these slides, but I probably won't get there, okay, so as, as I've, I've sort of gone, yeah, yeah, so I mean, get more data, actually may not fix it, bootstrap may not fix it, randomization can fix it, so I mean bootstrapping is not really making new data, it's, it's maybe giving a more, a more accurate distribution to sample from, but it's not really fabricating new data, the intervals that are actually quite long are the ones where the true effect is zero, so what happens if you collect more data, is if you have a small effect, then when you standardize everything, like the actual effect in the z-score gets multiplied by square root of n, where n is the sample size, but square root of n times zero is still zero, so you can't really get rid of the, so the ones that are really know this simple version of describing getting more data won't necessarily fix those, but randomization can provably fix, make them shorter, okay, so we have some, as I was, as we are the, we're facing this typical task, we have some data with no model, we you have some algorithm to generate questions out of some big collection, and we're going to test somewhere all of these suggested by the algorithm, okay, so if you want a formal description of the mathematical properties of what I'm describing, you might look at these, I forgot to mention there are, I put some, some pictures of students down here that, that were students I worked with at Stanford, and actually some of them are, three or four of them are actually around campus here this quarter, and you guys sitting in the back worked on the last one with me if you want to find him, Will here is in the stats department as an assistant professor now, and that's why I put their pictures here, so if you see them around campus and you're interested you could talk to them, okay, so there's a formal description mathematically of what this is in this paper if you're interested, let me try and describe, well it's described formally but without all the guts what's happening here, okay, so as we were just discussing beforehand I'm, I'm in the regression context, so I have a response y and I have a design matrix x and the regression model is usually that if once you know the design matrix x, y given x is normal with some mean vector mu and some covariance matrix sigma, earlier I said this was sigma squared times identity, okay, that's, that's how statisticians write this model, and this is a distribution for y given x and it depends on the parameter mu, usually we think of the variance as it being fixed in, often we think of the variance as being fixed in known, we don't have to, but for what I'm talking about today I'm going to assume that, okay, and so this model for, what is the statistical model, it's a collection of distributions for the data for y, in this case given x, and it's parameterized by this mean vector mu, now as I, as we said before for any choice here, e is a, think of that as a subset of the 91 features, for any choice of those 91 features there is a regression model that I could have used like lm in R to fit, and that regression model there are some corresponding population parameters, there are e of them as well, and you could try and form, these are the parameters that my intervals are supposed to cover, so once I know the mean vector mu I can write what the regression coefficient of variable j in the model with features e is, and I can ask the question whether it's zero or not, so I can think of this in regression the collection of possible questions to ask are, well, choose some subset of the features, e, and then maybe ask whether each, for each of the element of e, whether the regression coefficient is zero or not, that would be the test, if I was producing an interval for each element of e, form an interval that covers beta j e, and report that interval, so as, when I, when we do this for different y's we're going to get different e's, so we're going to get different reports out of the, out of the software, okay, okay, so now what do we actually do? So, so this in one line summarizes a lot of basically this whole paper on the last one, well not the whole, okay, what do we do? We actually, we, we take this conditional approach, as I said before when we were talking about intervals covering something, if I fix the model that I was to choose, then there are fixed numbers that my interval is trying to cover, and there are fixed parameters I'm trying to ask whether there's zero or not, and so we condition on the event that, so e hat here is what the, what lasso, actually the variables lasso selects, and z of e hat, this is my notation for the sign of the variables the lasso selects, and we just condition on the event that they are a particular one, and then well in, for any particular dataset we're going to have a realized version of e hat, right, when I run this function, it's going to give me for this y and this x, it gives me those 16 variables, and it gives me 16 signs, so there is some subset of size 16 of the 91 features, and some 61 signs, this is a distribution I could use for the data, and it's actually a model because it's indexed by the mean vector mu, and I just use this distribution for inference, now I'm not going to give you the deep, the gory details of that, it's actually not too gory in this case, I'll give you a picture to try and describe what goes on here, but this is earlier in the regression model, I had a, it was a parametric statistical model indexed by the mean vector, now in this case this is also a parametric statistical model, indexed by the mean vector, and there are lots of tools and statistics to do inference in parametric statistical models, I'm just using that here, yes, yes, yes, so it's not Gaussian, even if it started with Gaussian, yeah, okay, and so we're just going to, when and when we run in practice we're going to plug in the realized values and report intervals based on those, okay, so let me try and describe what happens in the last so this is a picture that, this is getting close to the graphic that Ali sent around that I was going to try and explain a little bit in the talk, this is an example of what happens when you fit the lasso in a concrete case, here n is equal to two, so I only have two responses and I have three possible features, okay, so what does the lasso do, it finds a subset of variables that it declares to be non-zero, and I have also asked it to give me the sign of those non-zero variables, so when I say that the lasso gives me a subset of variables that means there are different sets, there's a partition of the sample space, n is equal to two, so I have a partition of the plane, and in different regions my function gives me different values, it gives me different variables and different signs, and this is what the partition looks like, and so what does it look like, it's a partition of the plane into polyhedral regions, and we can describe these polyhedral regions explicitly if you give me the design matrix x, okay, so now suppose I ran the lasso on this y with this x matrix, these are two-dimensional vectors, so three of them gives me a two by three matrix, so I suppose I ran the lasso with this particular, and the value of lambda I should say determines the sort of the scale of this box in the middle, but suppose I ran it on this y, then y was in this particular region here, and the lasso would tell me that variables one and variable three are non-zero, and they have positive signs, so now what I'm what I'm arguing for is I'm going to take the conditional distribution of y conditioned on being in this region, and I'm just going to do parametric inference after that, okay, and it turns out for the lasso you can actually reduce the inference problem into something something even simpler, you actually just have to work out a univariate distribution where you if I want to I think this is testing whether x1 is x3 is equal to zero, beta3 sorry is equal to zero, you have to restrict the distribution to this one-dimensional line segment here, and if you read the papers you can find a formal description of that, but this is in two and three dimensions exactly what I've reported there, okay, so what makes it work is that this partition that I'm talking about is I'm taking my r633, and I'm breaking into different pieces, and I have to be able to describe these pieces, and it turns out for the lasso for a fixed value of lambda, you can describe the partition in terms of affine inequalities, so these affine polyhedral sets, and you can describe them explicitly if you're given the design matrix x, and here x sub e, you should think of that as the e columns of the matrix x, you can describe the the y's that would give you the same variables and the same signs are all y's that satisfy these inequalities, and that's actually enough to do all the computations I showed you so far. Okay, so how am I doing on time? Yeah okay, so I just want maybe I'll wrap up with just a report just to show you what that you what the p values look like, so I'll describe, okay, so I showed you intervals before, so you can also construct hypothesis tests of whether a parameter is zero or not in this conditional distribution, so it's a parametric distribution and it's actually an exponential family, so I can, well there's lots of tools to construct tests, so what should a test satisfy? It should be when you use this conditional distribution, the test should have a type one error of alpha or less, and so for each particular variable in the regression model, each of those 16, I can ask whether that regression coefficient is zero or not, and report a p value if you like. Okay, so and I just, it cut off some of them, I've just made a report here of the naive p values, that's what we saw earlier in my simulations where I where I just used the standard z statistic and I used the standard the p value conversion for a usual z statistic, that's what I call the naive OLS p value and this selective p value, and well just for this particular instance, comparing these two, well let's see which ones would be declared significant, if you used the naive one, those would be all the red ones, I haven't counted how many there are here, looks like about 10, but there are some differences, so using this selective test, we would say well the variable lasso chose variable 115f, and if I had not adjusted for selection, the p value here would be 0.01, it's less than 5%, so I would have declared that a significantly non-zero if I didn't adjust my inference, but using these corrected tests, it no longer is significant at the 20% level, and I think I will wrap up there, I have more, you can look at in the notebook and I'm around for the quarter if anyone wants to talk about it later, so I think I'll stop there, yes, are there questions in the answer, or yes