 Okay. So we're live now. Good morning everybody. I'm excited to introduce our first keynote speaker of the R Medicine Conference. Danielle Witten is a professor of statistics and biostatistics at the University of Washington and is the Dorothy Gilbert Endowed Chair of Mathematical Statistics. Her research involves the development of statistical machine learning for high dimensional data with applications to genomics, neuroscience, and other fields. And she's particularly interested in unsupervised learning with a focus on graphical modeling. Danielle has been the recipient of a number of honors, including among others, NIH Director's Early Independence Award, Sloan Research Fellowship, and an NSF Career Award. In 2020 so far, she's received the Breiman Jr. Award and was elected as a fellow in the American Statistical Association. She's also been named in the Forbes 30 under 30 Science and Health Care category three times. Danielle may be best known to many of us as a co-author with Garrett, James, Trevor Hasty, and Rod Tipshirani of the popular textbook Introduction to Statistical Learning. But her work has also been featured in popular media such as in Forbes and L on Seattle's local NPR station, in a NOVA show, and as a pop tech science fellow. Today she will be talking about beyond sample splitting, valid inference while double dipping. Okay, thanks so much for the very kind intro from this and to the organizers for putting together this conference and to everyone taking time out of their very busy lives at this really chaotic moment to attend my talk. So today, as as Beth mentioned, I'm going to be talking about beyond sample splitting. And the idea here, sorry, give me one second to figure out how to share this. So the idea here is that often in statistics, there's a gap between what we want to do when we analyze our data and what we know we should be doing. And in particular, I'm going to refer to that gap today as double dipping. The idea here is that in theory, we know what we're supposed to be doing. So here's a little cartoon of me being very virtuous and going into my data analysis with a really very clearly pre specified plan. So in particular, before I see my data, I know exactly what hypothesis I want to test. And I'm only ever going to test that hypothesis. And in theory, this is how all data analyses should go. And this is what we learned in our intro stats class, that we got to go in there knowing what we're going to do, and we just got to do just that. And if we do that, if we behave in this very virtuous way, then we're really not going to run into any trouble and everything is going to work out in terms of the statistical properties that we expect to see for the methods that we're using. But life's not perfect. And I at least I'm not that virtuous. And so the truth is that in practice, my data analysis, and I bet many of yours go a bit differently from what I showed on the previous slide. So in practice, I visualize my data, I perform clustering, I perform dimension reduction, I might even remove a few outliers. I explore my data in all sorts of different ways. And then realistically, then I decide what hypotheses I want to test. And the problem is that the typical framework that we use for thinking about our data analysis and for hypothesis testing does not allow us to do this. Like this practice of how data analysis is performed is really not allowable. So why is there this gap between theory and practice? Well, it has to do with the way that we've been generating data increasingly in the last couple of decades. And the point is that there's been a real shift in the way that we've collected data and in the reasons we've collected data away from hypothesis testing and towards hypothesis generation. So it used to be that researchers would collect a data set and they'd say like, okay, you know, I'm going to measure these four proteins in mouse tumors. And I just want to know do these protein levels go up or do they go down over the course of the tumor progression in the mice. And so if you're collecting data to answer like a very specific question like that, then yeah, you can specify your hypotheses in advance and it's easy to be virtuous in your data analysis. But that's really not often how we're collecting data today. Nowadays, we're collecting larger and larger data sets. And we typically don't have a specific hypothesis that we want to test. We don't have a specific question in mind. Instead, we're increasingly collecting data with an eye towards hypothesis generation. So the point is we collect data and we just want to find something interesting in the data. And then also we want to double check that what we found and that we thought was interesting really is interesting. So in particular, often we're generating a hypothesis based on a data set. And then we're testing the hypothesis based on that same data set. And in this talk, I'm going to refer to this practice as double dipping. And I think a lot of you have intuition for the fact that even though this is something that we do all the time, this is very deeply problematic from a statistical perspective. And we can really run into a lot of trouble if we do this double dipping and we're sort of not careful about how to account for it in our analysis. So I'm going to give you a few examples of double dipping. And these are in particular three examples that I've been studying in my research group. So the first example of double dipping has to do with clustering. So here I have like 100 observations on two features. And if you want, you can think about this as data with just two variables that we've measured. Or you can think about this as data where I've projected onto the first two principal components. But the idea of clustering is that I take these observations. I, for example, perform hierarchical clustering maybe to find three clusters. And then I can now label the observations according to these clusters that I've estimated. Okay, so it turns out that clustering, which is just such a fundamental tool across so many areas of research and data analysis, really naturally leads to a big double dipping problem. And that's really going to be the focus of my talk today. So what goes wrong? Well, let's say that I sample 100 observations from a noise model. So these observations are shown in black and there's no signal here. There are no true clusters. These observations are just generated like normal 01. But I can cluster the observations just to see. And I will get clusters because if I cluster data and I ask for like, you know, three clusters using hierarchical clustering, I will get three clusters. Because looking, you will find you can always find clusters even if they're not real. And then I can say, well, okay, are those clusters that I got really real? Is there a difference in means between the estimated clusters? And actually what happens is a little bit alarming, which is I can say like, well, you know, what's the mean of the blue cluster? What's the mean of the green cluster? What's the mean of the orange cluster? I can measure how far apart they are. I can compute a p-value using, for example, like a z-statistic or a t-test. And what I get are really, really tiny p-values. All three of those p-values are very small. So this is really bad. Because again, if you remember how I generated the data, there was no signal here. There were no true clusters. But when I did clustering and then I tested for a difference in means, I've now rejected the null hypothesis of no difference in means between the clusters. So I would erroneously think that these clusters are real. Okay, so you're probably listening to this talking and you're like, okay, but that would be really naive analysis. Obviously the problem here is that I didn't have a test set. If I had just had a test set, then we could have avoided the situation. So let's see if that's really true. Let's see if sample splitting can get us out of this hole. So on the left, I once again have this so-called null data set where there's no true clusters, but now half of the observations I've colored as, I've displayed as squares and the other half I've displayed as triangles. So the squares are my training set, the triangles are my test set. So that's step one. We've divided the data into a training and test set. And now what we're going to do is we're just going to cluster the training set. And here in particular, I'm clustering just to get two clusters. So I've clustered the training set. And now I'm going to label the test observation using the clusters obtained on the training set. And in particular, here in this example, I used three nearest neighbors classification to label the test observations according to their training set clusters. But the details aren't really that important. The point is that I got these clusters on the training set. The clustering, that like snooping that I did that the data happened on the training set, I'm just applying these clusters to the test set. All right, so now we can test for a difference in means between those clusters on the test set. But look what happened. The orange cluster mean is still far apart from the green cluster mean, even on the test set. And once again, I get a really, really tiny p value. So I was not able to solve this double dipping problem by way of sample splitting. And that's, it's kind of like a crazy situation because we always think that sample splitting is going to solve our problems, but it doesn't. Here's a very concrete example where sample splitting did not help us. And this double dipping problem is somehow fundamental in a way that we weren't able to get around by way of sample splitting. So just to be totally concrete, sample splitting has an inflated type one error rate. In other words, if I said that I only wanted to reject the null hypothesis for p values less than 0.05, I would accidentally reject the null hypothesis like almost always, even on examples like this one where there's no signal on the data. Okay, so it turns out this double dipping issue is not limited to clustering. So we can also see an example of it in decision trees. So what's the decision tree? Well, here's some observations again, showing observations in two dimensions. So the axes here, you can think of them as x1 and x2 for two features. Or you can think about this as the projection of data onto a couple principal components either way. And now I hope you're able to see on the screen, each observation is colored according to the value of the response y. And so a lighter blue indicates a large value of the response and a darker blue indicates a very negative value of the response. And this is just noise. So there's like no signal on this data. There's no relationship between the response and the features. Okay, so we're going to do some clustering and excuse me, we're going to fit a decision tree just using cart. This is just a very standard regression tree. And what we find is that okay, it turns out that if x1 is less than negative 9.2, we should predict a response of negative 0.92. And if x1 is greater than negative 9.2, we should predict a response of 0.12. So even though there's no signal in the data, we can still fit a decision tree. And this is what we get. So let's let's try to understand why double dipping happens here. So to make it really concrete, we're sampling 100 observations with just noise. There's no signal. Fitting a decision tree is the same decision tree I showed you before. Now I'm also showing you in red, the partition of feature space corresponding to that decision tree. And now we're going to test the difference in means between the two terminal nodes in that decision tree. So in particular, the mean on the left of the red line is negative 0.92. And the mean on the right of the negative of the red line is 0.12. Those are the sample means and we want to know is there actually a true difference between the population means to the left and right of the red line. And if we do this, we get a p value of 0.002. So again, what happened here? There was literally no signal in our data at none. I just generated this data as noise. But if we fit just a decision tree with two terminal nodes, in other words, we just look for just a good way to split the data either across the x-axis or the x1 axis or the x2 axis. And we're allowed to decide where to put that split in order to make the difference in means look as big as possible. And then we test for that difference in means without adjusting for the fact that we chose the decision tree in order to make that difference in means look big, then we're going to erroneously reject the null hypothesis. And finally, my third example of double dipping today has to do with change point detection. So the idea of change point detection is you have some observations over time. So here you can think about the x-axis over time or it doesn't need to be time, just any kind of sequential x-axis. But thinking about the x-axis here is 100 time points can help make it concrete. And the idea in change point detection is that the mean is constant over time, but with occasional change points. And you're trying to find out where those changes happen. So here on the right hand side in purple, I'm showing you that the mean is constant until around time point 45. And then it jumps up. Okay, so how does double dipping happen? Well, again, I'm going to sample 100 observations. These are just normal zero one observations. There's literally no signal here. But I can always detect change points. I mean, I can always estimate a change in mean. And that's what I've done here using some change point detection algorithm. And then I can just compute a p-value for whether there's a difference in means to the left and right of the estimated change point. And if I do that in this example, my p-value is 0.03. And again, why is the p-value small even though there's literally no signal on my data? Well, the reason is because I got to decide where to put that change point. I literally put the change point in the spot along the data that made the estimated difference in means look as big as possible. And so given that I got to choose the change point to make the estimated difference in means as big as possible, we should not be surprised that I've estimated a change in means or rather that I reject the null hypothesis of no change in means. So I've just been showing you a couple of examples of p-values on a single draw of simulated data. So you might wonder, well, did I just pick out the worst example or is this really a systematic problem? And the answer is yes, this is systematic. So here's a QQ plot. And if you're just looking at QQ plots, this is probably pretty hard to even recognize because it just looks so weird. But on the x-axis here, I'm showing you the quantiles of the uniform zero one random variable. And on the y-axis, are the quantiles of a p-value where the p-value did not adjust for double dipping. So in other words, this is just a standard walled test that just tests for difference in means without considering why I'm testing for that difference in means without considering the fact that double dipping was done. And this is data where there's actually no true change in means. There's no actual signal. So if all was well on the world, then my p-values would fall exactly on that 45 degree line shown in red. But in fact, my p-values are like that black droopy thing shown on the bottom of this curve. And what that means is that my p-values are tiny instead of being uniform. In other words, like it's not like sometimes double dipping is okay. Sometimes it's not. It's like, no, if you do double dipping, your p-values are just going to be completely off. So the p-values aren't uniform. And the type one error rate is 97%. And what I mean by that is that in cases where there's no signal in the data, so I've just generated noise data, instead of rejecting the null hypothesis at level .055% of the time, which is what we'd want to see, we actually are going to reject the null hypothesis 97% of the time. So in other words, 97% of the time we're going to be fooled by our data if we do this double dipping. And I just want to emphasize, like this double dipping on the one hand, it might seem obvious to you that we shouldn't be doing it. But on the other hand, like this is what we need to do from our data. Like we need a way to be able to do some kind of exploratory data analysis and then test the significance of the results that we get. So it's clear that just this naive double dipping isn't the right thing to do. But the question is, okay, well, if that isn't right, what should we be doing? And we've already seen that sample splitting isn't necessarily a good fix. Like for example, in the context of clustering, sample splitting does not solve any of our problems. So what's wrong with double dipping? Well, let's try to flesh this out a little bit more. So when we're testing a null hypothesis, the null hypothesis that we're testing can't be a function of the data. Like the null hypothesis should have to do with the parameters, but it shouldn't have to do with the data itself. And the problem is that when we double dip, we're actually choosing the null hypothesis that we want to test after looking at the data. For example, in the context of clustering, we're deciding to test if the mean of the observations in cluster one equals the mean of the observations in cluster two after looking at the data. And that is not what we're supposed to do. Null hypotheses need to be pre-specified in advance. They shouldn't be functions of the data. So the consequences we're going to reject the null hypothesis too often. And so in this talk, I'm going to talk about a fix for double dipping in the context of hierarchical clustering. But similar ideas apply to change point detection and also decision trees. And so my research group has been working on double dipping for all three of these different methods, but the sort of the general big picture framework is the same. And I'm just going to focus on hierarchical clustering today. Okay, so what we're going to do is we're going to try to fix the issues associated with double dipping for hierarchical clustering because we want to be able to cluster our observations and then test the mean that tests rather test the null hypothesis that the mean in the orange group equals the mean in the green group. In other words, we want to be able to ask whether the estimated clusters are really different. And we want to do this in a way that accounts for the fact that the clusters were estimated based on this data. Okay, so to phrase this a little bit more statistically, we need a model for our data. So the idea is we're going to have an n by q data set. And the i throw of this data set is just going to be drawn from a mean zero variant Sigma squared model where Sigma squared is a known variance. And I really apologize that there's a typo on the slide on it should say that Xi is distributed iid normal with mean mu and variance Sigma squared iq. So the idea is that each observation is able to have its own mean mu. And I apologize for that typo. So the idea is that we're going to cluster the rows of this data set in order to get cluster labels see one hat through CK hat, and then we're going to let mu bar CK hat be the mean associated with the case cluster. So mu CK hat is the population mean associated with the observations that were assigned to the case cluster through my clustering procedure. And again, just to be totally clear what the typo is, the typo is in the second line of the slide. I wrote normal zero Sigma squared i and what I meant is normal mean mu Sigma squared i or really mean mu sub i Sigma squared i. So each observation has its own mean mu i and then shared variance Sigma squared i. So the question that we are asking is, is it possible to test the null hypothesis that the mean in the case cluster equals the mean in the k prime cluster in a way that accounts for the fact that these clusters were estimated from the data. So the standard way to do this to test this null hypothesis without accounting for the fact that we did double dipping or the fact that we're doing double dipping is to use a walled test. And the wall test is pretty straightforward and has a p value that sort of makes a lot of sense. The p value is just the probability under this null hypothesis that we would observe at least as big of a difference in means as what we actually see. So first of all, just to try to decompress what is in this p value, first of all, look at the denominators on the left and the right of this inequality. The denominators are the same, so you can kind of just ignore them. They're not very important. They just come in handy a little later. So if we just look at the numerator, the right hand side is the squared distance between the estimated mean in the case cluster and the estimated mean in the k prime cluster. So what we're saying here is what's the chance of seeing such a big estimated mean between the kth and the k prime cluster under the null hypothesis that there's really no difference between the means. So this is exactly what a p value usually does, right? It's saying what's the probability under the null hypothesis of seeing such a big difference. And now we'll see why the denominator was useful. Well, the denominator was useful because xi is drawn from a normal distribution with mean new i invariant sigma squared i. So this entire term on the left hand side is actually a chi-squared random variable with q degrees of freedom. So what we're asking is what's the probability that a chi-squared random variable with q degrees of freedom is greater than or equal to the right hand side? Well, this is something that we know how to calculate. This is just using the quantiles of a chi-squared random variable. And so this is the standard walled test. But as we saw, this is not going to control type one error properly. We're going to reject the null hypothesis way more often than we should. And the reason for this is, again, because those clusters were estimated based on the data. And again, so since c hat k and c hat k prime were obtained by clustering x, we picked the null hypothesis by looking the data and then we used the same data to test the null hypothesis. So this is a problem. And again, the issue is that the walled test doesn't adjust for double dipping. Okay, so how are we going to fix this? Well, basically, we're going to fix the problem by conditioning on the way that we selected that null hypothesis. And so in particular, what if instead of just defining the p value to be the probability that we observe such a large difference in means between the estimated clusters, what if we instead define the p value to be the probability that we observe such a big difference in means between the estimated clusters, that's the left-hand side, but conditional on something. And in particular, let's condition on the fact that our clustering actually gives us the estimated clusters, c k hat and c hat k prime. So this p value says, out of all data sets that result in clusters, c hat k and c hat k prime, what's the probability, assuming no true difference in the means, that we see such a large difference between the sample means of c hat k and c hat k prime. So the key distinction between this and the walled test is that wording out of all data sets that result in cluster c hat k and c hat k prime. And that wording is reflective of the fact that the only reason that we're even testing the null hypothesis in the first place is because our data gave us the cluster c hat k and c hat k prime. If our data hadn't given us those clusters, we never would have even asked about the null hypothesis. And so the idea behind this p value is it's going to control something called the selective type one error rate. And what that is is it's the error rate, given that the null hypothesis is true, but also given that we decided to test that null hypothesis. Because there's no point in asking what's the chance that a data set would have shown such a big difference in means between c hat k and c hat k prime, if that data set wouldn't have even let us to test for that difference in means. So the selective type one error rate is the error rate given not only that the null is true, but also that we decided to test this null hypothesis. And the reason that we decided to test that null hypothesis is because the clustering gave us those clusters. So that's why this new p value that we're interested in has this extra conditioning event in it. Okay, so this sounds like a good idea. And if we could compute this p value, then we would be good to go. And we would have fixed our double dipping problem because this p value would account for double dipping. But the problem is that this p value is very hard to compute. And the reason that it's hard to compute basically is because here we're conditioning on like the universe of data sets that give us c hat k and c hat k prime. But I need some way to like analytically characterize that universe of data sets so that I can actually do this calculation, like actually get a number for this p value. And this conditioning set is just pretty hard to work with. It's sort of unclear exactly how to grapple it. Part of it is that there's a lot of nuisance parameters because I'm conditioning on the fact that clustering x would give me c hat k and c hat k prime, but I haven't conditioned on anything about what the other clusters would be. And so it's just a little bit hard to figure out even where to begin with this p value. So it's kind of like it's a good idea in principle, but maybe not that practical. So the idea is that instead of using that p value, we're actually going to condition on a little bit more what I'm going to call other stuff. And when we add in this other stuff, we're still going to control the selector type one error. But we're going to also make our computations easier. And I should mention this is very much related to a line of work by Will Fithian, Jonathan Taylor, Jason Lee and others from a few years ago within the context of regression. Okay, so what's the other stuff going to be that we're going to condition on? Well, we're going to define the p value once again to be the probability under the null hypothesis of seeing such a big difference in means, given that our clustering resulted in c hat k and c hat k prime. But we're also going to condition on a whole bunch of things. And if just looking at this gives you a headache, then please just bear with me because it turns out that even though this looks terrible, it has sort of a simple and intuitive interpretation. So again, this additional stuff that we're going to condition on looks really bizarre, but please bear with me. We're going to understand in pictures what is this additional conditioning that we're doing. Okay, so we're going to need a mathematical result. And this mathematical result is going to help us make it clear to us why it is that we're conditioning on all this weird extra information. But what this result tells me is that the p value that I'm interested, which once again is the probability under the null of observing such a big difference in means between c hat k and c hat k prime, given that clustering the data gave me c hat k and c hat k prime and also conditional on some weird stuff, that is actually equal to the probability under the null that a chi square distributed random variable is at least as big as the difference in mean that we observed, given the clustering x prime of phi results in c hat k and c hat k prime, where x prime of phi is an n by q data set with rows given like this. So x prime of phi is actually a perturbation of our original data. The i throw of x prime of phi is equal to the i throw of x, if the i th observation is not in the cluster k and not in the cluster k prime. But if the i th observation isn't the cluster k or in the cluster k prime, then x i prime of phi just perturbs that i th observation a little bit in a very specific way that I'm going to talk about in a couple minutes. So really what's the point here? Well, the point is that the p value that I care about is just the probability that this random variable phi is at least as big as the difference in means that I observe between the k and k prime clusters. Given that clustering x prime of phi results in those same clusters where x prime of phi is just the original data, but with some of the observations perturbed by a function of phi. And I'm going to talk more about what this perturbation looks like in a couple minutes, but in the meantime, please bear with me. And let's just talk about what the interpretation is for this p value. Well, this p value is saying out of all of the data sets in the universe of the form x prime of phi that results in c hat k and c hat k prime, what's the probability, assuming no true difference in the means that we would observe such a large difference in the sample means of c hat k and c hat k prime? So remember earlier, I told you that the p value that we wanted to compute was out of all data sets giving me c hat k and c hat k prime. What's the probability of seeing such a large difference in the sample means? Well, now instead of out of all data sets, I'm saying out of all data sets that are of this form x prime of phi that results in c hat k and c hat k prime. So I'm not going to consider all data sets in the universe that gives c hat k, c hat k prime. I'm just going to consider data sets of this particular form given by x prime of phi. And in a minute, we'll see that this particular form of x prime of phi actually kind of makes sense. So now we're going to talk about what x prime phi means on the next slide through a series of pictures. And so again, this is the definition of x prime of phi. It's just equal to x i if the i-th observation isn't in either the k-th or k-prime cluster. And if the observation is in the k-th or k-prime cluster, it's just a perturbation of x i by a function of phi. So here's the original data. In a little example, here's x and notice that I've clustered x. I got three clusters, blue, orange, and purple. I've called the blue cluster c hat k and the orange cluster c hat k prime. And this dash line here is the line that passes through the means of c hat k and c hat k prime. And so on the right, what I'm showing you is x prime of phi where phi is positive and in particular phi is larger than the observed difference in means between c hat k and c hat k prime. And if you compare the left and the right, what you see is that x prime of phi is exactly the same as x, but we've actually pulled apart the k-th and k-prime clusters. The observations that are not in the k-th and k-prime clusters are exactly in the same place. So notice the purple observations haven't moved, but the blue and orange observations have actually moved away from each other. So what x prime of phi does is it pulls the blue and orange observations away from each other if phi is greater than the observed difference in means. But if phi is less than the observed difference in means, then it pushes the blue and orange observations towards each other. So once again, the purple cluster hasn't moved, but the blue and orange clusters have moved towards each other in x prime of phi. And then here on the right is x prime of phi if phi is actually exactly equal to the observed difference in means. And what you see is that in this case, x is equal to x prime of phi. So again, just to state it one more time, x prime of phi just takes x and then it either pushes the k-th and k-prime clusters towards each other or else it pulls the k-th and k-prime clusters away from each other depending on whether or not phi is less than or greater than the observed difference in means between the k-th and k-prime clusters. So x prime of phi is doing something that's really sensible. It's just keeping any part of the data that doesn't have to do with the clusters that we care about exactly the same. And then it's just either pushing together or pulling apart the clusters that we do care about. And again, what is our p-value? Well, our p-value is the probability under the null hypothesis that phi is greater than or equal to the observed difference in means given that clustering x prime of phi results in those exact clusters that we're interested in. So in other words, we think about like the universe of datasets that looks just like our data, but with the clusters of interest either pulled together or pushed apart, we limit ourselves to the subset of the universe that would have given us those exact clusters. And then we say, okay, in that universe of datasets that gives us the exact clusters we got, what's the probability that the observed difference in means was as big as what we saw? And that'll tell us whether or not we should feel surprised about the difference in means that we did see. Okay. All right, so now we just need to compute this p-value. So the p-value, again, it's the probability under the null hypothesis that phi is greater than or equal to the observed difference in means that we saw, given that clustering x prime of phi results in the two clusters that we saw. And this is just actually a very simple quantity and it has to do with the cumulative density function of a chi-squared random variable with q degrees of freedom truncated to a particular set. And in particular, we need to truncate this chi-squared random variable to the set of phi such that clustering x prime of phi results in the cluster of c hat, k c, hat, k prime. So if we know the form of the set, for example, if we know that the set s is just like an interval from like 3.1 to, you know, 96.2 or something, then we can literally, just like in one line of our code, compute this p-value. So it seems like we should be done, but there's just one little question we need to sort out, which is how do we compute the set s? So how are we going to actually figure out the set of phi such that clustering x prime of phi gives us the same set of clusters we got on our data? And I just want to mention, like this is actually where the work is for this project. It would be really tempting to just try to do like a really boneheaded thing where we just, like for phi equals zero, we just cluster the data and we're like, okay, did that give us phi hat k and c hat k prime? And then if yes, that then phi equals zero goes in s and if not then it doesn't. And then repeat for phi equals 10 to the negative 16. And then repeat for phi equals two times 10 to the negative 16 and so on all the way until you get to phi equals infinity. So if we, if we did that, if we had like infinite computation resources, then we would be done right now. But of course, we do not have infinite computation resources. So we need an efficient way to do this. So how are we going to compute us? Well, the first thing we need to do is notice that there's a really elegant property of hierarchical clustering. And what that property says is that if the clusters that we get in hierarchical clustering are the same up until the L split, then the trees are actually going to be the same too. So to see what I mean, let's say that we have some data and when we cluster it, hierarchically here's what we get. So for example, three and five cluster together and two, one and four cluster together. Well, in that case, all of the splits under that black dash line, or rather what I mean is if the clusters three, five and then also two, one, four are identical to each other, then everything under that black line needs to be identical. So for example, this is not a possibility. So here we once again have one, two, four together. And we once again have three and five together. But if you see like the things happening under the black dash line are different on the left and on the right. And so this is something that actually just can't happen. This is not possible. Okay. So the next thing we're going to do is we're going to actually notice that the second line here can be rewritten. So the set of fees such as the clustering x prime of fee results in C hat k, C hat k prime. Well, we just saw that this is also the set of fee for which the same pair of clusters merge at step L and x prime of B and x for every value of L. But in particular, what that is is it's a set of fees such that the dissimilarity between any pair of clusters A and B and x prime of phi is at least as big as the dissimilarity between the clusters that actually emerge at step L and x. And the reason for this is because if the dissimilarity between two clusters A and B and x prime of phi were less than the dissimilarity between a pair of clusters that merges at step L and x, then that would violate the result that we saw on the previous slide, which says that if x prime of phi has C hat k and C hat k prime, then all these splits under that pair of splits needs to be identical. So, you know, the details of this exact equation aren't really that important. But what I do want you to notice is that we're taking two intersections here and all together this is the intersection over order and cube sets. And then here we have a set involving fee and it turns out that for very popular types of hierarchical clustering, in particular for average linkage, centroid linkage, median linkage and word linkage, this set fee is actually just quadratic. This set is actually just quadratic and fee. It's a quadratic inequality and fee. So like if you remember the quadratic equation from long ago, you can use the quadratic equation to find this set. So that means that in order to characterize the set S for these very popular types of hierarchical clustering, we literally just need to solve n cubed quadratic equations and then we're done. So we can compute the set S and O of n cubed times for these very popular forms of hierarchical clustering. And it actually turns out that for single linkage hierarchical clustering, we can compute S even faster in order of n squared times. On the other hand, for complete linkage, we need to use a Monte Carlo approach to approximate the set S. And similarly, if we want to use other types of clustering that are not hierarchical, like for example, K-means clustering, then right now the best thing that we know how to do is a Monte Carlo approach to approximate S. Okay, so let's just review where we've gotten. We wanted to test whether the mean of the k cluster equals the mean of the k prime cluster. And we showed that in order to do this, we just needed to calculate the CDF of a chi-squared random variable with q degrees of freedom truncated to a particular set. And the set was a little weird and hard to characterize, but it turns out that we can actually characterize the set in order n cubed time for average centroid median and award linkage. And we can calculate it in order of n squared time for single linkage. And for complete linkage or for k-means clustering, we don't know how to compute that sufficiently, but we can approximate the p-value using Monte Carlo. Okay, so how does this actually work? Well, here are some results in simulation. So here I simulated data just under this noise model, where there's just no signal whatsoever, clustered to get three clusters. And now I'm going to test if there's a difference in mean for a randomly chosen pair of clusters. So I know here that there's no clusters. So I want my p-values to be uniform. I'm going to compute my p-value using my exact characterization of the set S for average centroid and single linkage. And I'm going to use my Monte Carlo approach for complete linkage. And so here's what I get. These are qq plots. So I want all of the p-value curves that you see to be on the 45 degree line. And indeed, that's exactly what's happening here. I've controlled type one error perfectly. The p-values are exactly uniform. And this is in contrast to the qq plot that I showed you at the beginning of this talk, where the p-values were extremely non-uniform because my type one error was massively inflated. Okay. Well, now what if I simulate data where there actually are clusters? So now I've simulated data with 30 observations and three clusters. There's actually 10 dimensions in my data, but here I'm just showing you like a two dimensional version of this so that I can display it on a slide. And I'm going to calculate two things. I'm going to calculate the probability that we reject the null hypothesis, given that c hat k and c hat k prime are true clusters. And the reason that I need to condition on that is because if c hat k and c hat k prime aren't actual clusters, then the null hypothesis would hold. In other words, I can only really talk about power if the null hypothesis does not hold. And that's why I'm conditioning on the fact that c hat k and c hat k prime actually are clusters. And I'm also going to talk about the detection probability, which is the probability that I correctly estimated the clusters, c hat k and c hat k prime. In other words, the probability that those are real clusters. So here's what I see. For average, centroid, complete and single linkage in different colors as a function of the signal size on the x-axis, conditional powers on the left and detection probabilities on the right. And I can see that my conditional power increases as the signal size increases. And so does the detection probability. The detection probability on the right, that doesn't really have to do with my double dipping P values at all. The detection probability just tells me like how well each of these clustering methods can correctly detect true clusters. And anyone who's familiar with single linkage knows that it usually doesn't do that well. So you won't be surprised to see on the right that average centroid and complete outperform single linkage. But my results are really the one shown on the left. And what that's saying is that out of the times that we correctly detected the clusters, our methods were able to reject the null hypothesis a whole lot, especially using average centroid and complete linkage. So again, what we're seeing is that the conditional power and the detection probability increase as the difference in means between the true clusters increases. So this is really reassuring because it means that not only do we control type one hour error, but we also have good power. And we also see that which linkage you use matters. It's sort of well known on the right hand side that the detection probability is typically higher for average centroid and complete versus single. But we can see on the left that the conditional power is actually the highest for average and complete linkage. Okay, so finally, I just want to talk about briefly two applications to some data sets. So the first application has to do with polymer penguins. This is a really cool new data set. You can find the citation on the corner of the slide by Allison Horst. And the idea here is that we have 165 female penguins for whom we know both their species. There's three species. We also know their bill length and their flipper length. So the first thing that we do is we're going to apply average linkage hierarchical clustering to get six clusters. So here the estimated clusters are shown in different colors and the different species are shown in shapes. And the reason that I'm getting six clusters here is because I wanted to show you as a proof of concept that if we erroneously estimate clusters that aren't real, then we're going to fail to reject the null hypothesis. So in particular, if you look at these green and orange clusters here that I've circled, they're mostly circles, just a few triangles. So these clusters that I've estimated, they're mostly just Adelie penguins. And so we can wonder, well, if I test the null hypothesis for difference in means between these two clusters, will I reject it? And the answer is that if I did like a naive p-value that didn't account for double dipping, I would reject the null hypothesis. But our p-value, the focus on my talk today, will not reject the null, we'll get a p-value of 0.87. And that makes sense because indeed, there is no true difference between Adelie and Adelie penguins. So the green and orange are both actually part of the same true cluster. And so we were able to figure out that there's not a true difference in means using our p-value, but not using the naive walled p-value. By contrast, what if I compare the green and the pink clusters? Well, you can notice, first of all, that the green clusters mostly circles and the pink cluster is entirely squares. So these really are two different species of clusters. If I do the walled p-value, which doesn't account for double dipping, we get a p-value of 10 to the negative 16. And actually our p-value is 10 to the negative 15. So either way, we're able to really thumpingly reject the null hypothesis that these clusters have the same mean. But we're able to do it in a way where we don't also falsely reject when there's no true difference in means. And so finally, the last thing that I want to talk about is an application to some single cell RNA sequencing data. This figure was made by Stephanie Hicks. And of course, the idea behind single cell RNA sequencing data is that we're able to isolate and sequence individual cells from a tissue sample. We're able to actually get reads for the amount of RNA transcript from individual cells. And then the goal is to compare the gene expression profiles for these individual cells. So the data that we took is from this paper, massive, massively parallel digital transcriptional profiling of single cells. And basically, here's what we did. We first of all came up with a negative control dataset. And to get the negative control dataset, we just took 600 T cells. So as far as we know, there's no true clusters in these data. They're all T cells. So we took 600 T cells. And then we also took 200 B cells, 200 T cells, and 200 monocytes. And we're going to refer to this as a positive control dataset because here we have three different cell types, B cells, T cells, and monocytes. So what we would like to see is that in the negative control dataset, if we cluster those T cells, we'd like to not reject the null hypothesis. And then in the positive control dataset with both B cells, T cells, and monocytes, we would like to reject the null hypothesis. So very briefly, of course, like analyzing data required pre-processing. So we applied a standard pre-processing routine. And then maybe the most important thing to talk about is that the way that I've presented our test for difference in main across clusters assumed that each of the features was independent. And of course, for RNA sequencing data, as is the case for a lot of data from biology, that assumption of independence is like clearly not true because genes are highly correlated with each other in terms of their expression. So we applied a slight modification of the test that I've talked about today in order to allow for a non-diagonal covariance matrix among the genes. Okay. So first, remember, we were clustering all genes of all cells of a single cell type. We apply hierarchical clustering to get three clusters. And here I'm just showing you the projection of the data onto the first two principal components, which is why it might look a little bit weird that it looks like my clusters are overlapping. But again, it's because I clustered the data in the original high-dimensional space. And then after clustering, I've projected it onto two principal components. And after I do this, I can compute the p-values comparing the means of the green and red, green and blue, and red and blue clusters. And when I do the naive walled p-values, all those p-values are less than 0.01. But by contrast, the p-values for our test are quite modest. They're around 0.7 or 0.8. So we do not reject the null hypothesis that there's no difference in means between the estimated clusters. And indeed, we're right about that, since this was our, quote unquote, negative control dataset, where all of the cells were of the same type. And by contrast here, we've clustered three types of cell types. So we had, remember, 200 B cells, 200 memory cells, and 200 monocytes. We've estimated three clusters. And now the p-values that do not account for double dipping using the naive walled approach are all around 0.01. And our p-values are also very similar. So here you can see that our p-values are almost identical to the wall test p-values. We've lost very little by, or really we've lost nothing by accounting for double dipping properly in our analysis. Okay, so I just want to close with a few last comments. So the last comments, really, the point that I want to bring home is that double dipping is pervasive. And it's very easy for statisticians to tell people not to double dip. Like if you read a textbook, it'll say, you know, once you've done an exploratory analysis, you have to be really careful about your hypothesis testing. And in fact, you probably like can't hypothesis test after clustering. And it's not super helpful for a statistician to tell someone that they can't do this, because the issue is that people need to use their data to generate hypotheses. That is what the data is for. That's why they collected it, and that's how science is being done. So instead of telling people what not to do, we should provide people with methods that they can use in order to do these analyses safely. So I think that there's a clear analogy here to harm reduction in public health. So for example, you know, it used to be that people thought that the way to prevent unwanted pregnancies or sexually transmitted infections was through abstinence only education. Because indeed, the only 100% foolproof way to never have an unwanted pregnancy is to practice abstinence. But it turns out that abstinence only education isn't really that helpful, because it doesn't give people sort of like a realistic way to safely do the things that they want to do. And that's why safe sex education is now preferred. Similarly, we know that like from an individual health perspective, obviously, it's better not to inject drugs. However, instead of shaming people who inject drugs, it's more effective to provide safe needle exchanges. And then finally, in the context of COVID-19, the absolutely foolproof way to eliminate COVID-19 from the population would be to have a complete lockdown. But it is hard to get people to comply with a complete lockdown. So instead, what public health experts want to do is teach people how to socialize safely in a way that reduces the risk of transmission. And so the idea here is that I view the work I've talked about today not as harm reduction in public health, but rather as harm reduction in statistics, where there's types of analyses that people need to do. They need to use their data for hypothesis generation. So how can we as statisticians help them figure out ways to get what they need out of their data while trying to make the way that they're doing this as safe as possible? So I just want to close by thanking my collaborators who I'm really truly indebted to for their incredible contributions to the work I've talked about today. So I briefly mentioned the idea of double dipping for change point detection. And that work was led by my PhD student, Sean Joel, who graduated recently and is now being carried out by a current PhD student in collaboration with Paul Fernhead at Lancaster University. I briefly mentioned double dipping for decision trees. That works being led by my student Anna Neufeld in collaboration with my recent PhD alum, Lucy Gao. Lucy has just recently started actually a faculty position at University of Waterloo in Canada. And finally, the focus of my talk today was double dipping in the context of hierarchical clustering. And that work was led by Lucy in collaboration with our colleague Jacob Gann at University of Southern California. So I want to thank all of you for giving me the opportunity to talk to you today. And I'm happy to take any questions. Hi, this is Beth. And there are a number of questions that have come about one of them being whether there is software available to help with some of this. Yeah, that's such a great question. And of course, this is an our medicine conference. So for our change point detectives, like double dipping for change point detection, there is software. And I'm sorry that I didn't include the URL. But it's, you know, it's available through my student Sean Jules GitHub page. He's the one who led that work. It's actually some really nice software. Of course, it's written in R as it should be. And then for the work on hierarchical clustering, we're wrapping that up now. There is certainly going to be an R package available as soon as we finish writing the paper and the preprint is out. The preprint will also be available on archive pretty soon. And similarly for our work on decision trees, we're wrapping up the manuscript and both in our package and a preprint are going to be available as soon as that's done. Great. So one of the other questions was, I think people are still struggling with this idea that splitting the sample doesn't fix the problem and trying to get some intuition in terms of trying to understand that. Right. So it's crazy, right? Can you still see the screen? Great. Yeah. So it's crazy, right? Because it seems like how can it possibly not solve the problem? And I really think the best way to see it is through this example in the context of clustering. So again, we started out with data that's just noise. There's no signal here. We split it into a training set and a test set. And we did everything exactly right. We clustered the training set. We applied those cluster labels to the test set using, in this case, three nearest neighbors, but it doesn't really matter. But there's still a difference in means of the clusters on the test set. And it's like, how can that be? But actually, it makes sense because if you think about how we assign those observations to the test set, we basically were like, okay, if you're over here on the top of the slide, I'm going to call you orange. And if you're over here on the bottom slide, I'm going to call you green. So necessarily, there is a difference between the means of those orange and green clusters. So it's crazy and it's insane because we always think, I'm a big fan of cross-validation and the validation set approach. And I think that it can often get us out of a lot of trouble. But it isn't always going to rescue us. So in the case of clustering, it's crazy to me and it's mind blowing, but it doesn't rescue us. And then in the other examples that I gave, like decision trees and change point detection, well, you can make some headway using sample splitting. So for example, for decision tree, if you split your data into a training set and a test set, you fit a decision tree on the training set and then you apply that decision tree to the test set, then that can sort of help you. And it's not quite as egregious as right here with clustering. But then you kind of end up in a different problem where you've only used half of your data in order to estimate that decision tree. So it's kind of like the, you've tested in all hypothesis about the tree that's not actually the tree that you care about. But in the context of clustering, the sample splitting approach just does not work. And in other cases, sample splitting kind of is okay, but it's just not great because you reduce the amount of data that you have. So does this then expand into machine learning type tools in terms of having the same type of issue with, you know, thinking about having that training set and the test set? Yeah, I mean, I don't have a problem with like using a training set and a test set. First of all, I mean, training set and test set, like they're great, like cross validation is great, the validation set approach is great. These are all really useful tools. But I guess I'm thinking about this from the perspective of like a researcher who's like, you know, generated all of this single cell RNA-seq data about some particular, you know, condition or tissue or whatever it is they're studying. And they want to know are there subgroups? And then they want to see like, are they sure that there are subgroups? Can they just like validate that those are really there? And it's a bummer for that researcher to have to like cluster the only half of the cells in order to get p-values using the other half the cells, that would be a bummer. But also it literally wouldn't work. Because for the reason that I just showed in terms of the fact that sample splitting doesn't actually solve this problem using clustering. So I think that there are a lot of settings in which sample splitting is really great strategy. It just absolutely does not work in this particular context for clustering. And then even in cases where it does work like decision trees, like it works, but it also would be a bummer to throw away half of your observations. And we would just need to think pretty hard about that, I think. And a lot of situations we have small n. That's right. Yeah, in a lot of situations, right, we could have limited observations. But also it would be just conceptually would be too bad if you test in all hypothesis having to do with like how good are the clusters you estimated or the change points you detected. But you actually only detected those change points on half the data. So if you'd have more data, maybe you would have gotten, you know, right, exactly. Right. Are there other questions that people would like to bring forth? Either put in the chat or in the question section, kind of looking through the questions. Yeah, and some of the questions were, you know, is it because you're violating modeling assumptions or other things versus? Yeah, I see a question from R.C. Aurora. Was this done on all features of X distributed normal mu standard deviation or just the X and Y axis shown? And I think that that's a question about, I believe that that's still a question about the sample splitting slide. And there's all of the data shown here. There's no additional data. So literally, I simulated 100 observations from like a normal zero one model in two dimensions. And it's all shown here. There's nothing hidden. Okay. Yeah. Well, this has been really informative. I know I've been probably guilty of this in the past, thinking about things. So it's given me a lot to think about and want to go read more of your work. Look forward to your publications. So yeah, we are going to have a break now for 15 minutes. If there are other people or other questions, I'll stick around for a little bit. But otherwise, we're basically the next talk will begin at the half hour for whatever time zone you're in. All right. Thanks so much. Thanks to everyone for attending my talk.