 Excellent. Now you cannot see my secret comments. No, no, we cannot. Yes. I say complimentary things about everybody. Excellent. All right. Yeah, so you can go ahead. Yeah. Okay. Thanks, Yuri. Thanks to Jean and to Flohan and everyone at ICTP for making this workshop possible. Also thanks to all of the panelists for really interesting talks and for all of our attendees joining from around the world. This is fantastic. Okay. So my talk is going to be a little bit more statistics than I think a lot of the talks that we've seen over the past day or day and a half. I'm going to talk about doing statistical inference with adaptively collected data. So I want to start by just showing you a very simple experiment. So this is an experiment from a data set. Yuri, can you just verify that the screen has changed? Yes. You can see the next slide. Excellent. Yeah. We've seen the New England Journal of Medicine. Excellent. So this is a paper from, as Yuri said, the New England Journal of Medicine. What they basically did about nearly 12 years ago was to do what we would now call a Kaggle machine learning challenge. So they collected a large data set of patients who were prescribed the warfarin. Warfarin is very popular, particularly in North America. It's a very popular blood thinner. It's also a blood thinner that has a high variance in therapeutic dose. So some people require very little of it. Some people require a lot of it to get the therapeutic benefit. And it's very difficult to compute the right dose. And nurses and doctors spend a lot of time trying to compute the right dose for people. What they did was to try and come up with a simple model that can try and do this prediction task. So there's a large set of patients, about 5,000 patients, and then they computed the stable therapeutic dose of warfarin. These are the Ys. And then they tried to fit some function. In this case, it turned out that the linear model was actually probably the best fit for predicting the dose based on the set of features or covariates that they had about these patients. So one thought experiment that I wanted to do, which we'll see simulation results about, was can we imagine what would have happened were this data collected in batches or were it collected adaptively? So imagine that there was one batch of patients that came in. Every one didn't come in at the same time. There was one batch of patients that came in. They went through the protocol. We found the right dose for them, but there was perhaps one patient that this red patient didn't do well on the protocol. So the next time, in the next batch, when the second set of patients came in, and you saw a patient that looked kind of like the red patient, you replaced them with someone else and then stuffed them through the protocol and then so on and so forth. Okay, so let's imagine that instead of the data set just being collected a little bit agnostic, this is kind of what happened. Which is not completely unreasonable to imagine. One question that we might have is that, okay, perhaps the model is still right, but does this affect statistical estimation? So do we still get consistent estimates of this theta zero? What is the error rates and so on? And more importantly, does this also affect statistical influence? So can we compute confidence intervals, p-values, all of the usual stuff in the same way? Okay, so this data set is available online. So you can download it in experiments. So that's exactly what I did. I'm going to sample from this data set of patients about 10%. So 500 in two different ways. So in the first way, I'm going to sort of mimic this adaptive data collection. So I'm going to sample half of my sub-sample, which is about 250 data points, just uniformly at random, then compute some theta hat based on this one. Say like this theta hat could be just the least squares estimate on these 250. And then I'm going to sample the next set of 250. So the second half, just from the top 15% tile of the predicted dose. So I'm going to, from the rest of the population, I'm just going to pick patients who need a high predicted dose based on on on ceta hat intermediate. Okay, so you can think of this as a way of weeding out patients that might that might do worse on the protocol that you have. And we'll compare this with the sort of what we think of as as the ideal sort of statistically valid data collection, which is just purely random data collection. So instead of this two stage procedure, we'll compare this with just sampling the whole set of 500 uniformly at random. What we see is if you compute, say, the least squares estimate on the whole data set in both the scenarios is a histogram of errors that looks like this. So in the first scenario, the black dashed line shows the true effect in both cases. In the first scenario of adaptive data collection, where we have this two stage procedure, the histogram of errors on the sub sample looks looks like this shape. So it's a bit shifted off of the of the true effect. Whereas in the random sub sample, it's centered entirely around the true effect. So I'll make two observations about this. One is that the adaptive data collection doesn't seem to affect very much the size of the estimate estimator error. So the width of these two Gaussian strokes are roughly the same. But the estimator can be biased sort of shifted to the left. So the shape can change. Okay, so my goal for today is just to tell you about a very simple model for adaptive data collection, which which sort of captures these issues very well, and then makes them very clear. I'll tell you what we know from prior theory and sort of where is the point in in prior earlier theoretical work where where things fail. I'll tell you some some ways we've thought of to remedy this. Then I'll end with some some concluding thoughts on how this connects with other things that we know. Okay, so the model is going to be very simple. It's going to be a linear model. So there's a parameter theta zero. So this is something that we wish to infer from data. And then there's a sample of size n where y is linearly related to to theta using covariates Xi with Gaussian noise. But the data is collected in two batches. So just two batches. So this first in the first batch, all of the data points are just chosen IID from some distribution piece of x. Based on these x's you observe the corresponding y's and then you compute an estimate theta hard from it for a concreteness, let's say theta hard is the least squares estimate based on the first m data points. And then the rest of the data points m plus one through n, you are just biased samples of x. So you can think of important sampling x so that x and the bias sample of x has at least some correlation with theta hard with this intermediate estimate theta hard. That's computed on the first batch. So there's just sort of one point of adaptivity, which is the point of which you compute this intermediate estimate in the middle, which happens after m data points. Okay, so there's sort of, I think a couple of reasons to focus on this model. One is that it's very simple, but it also approximates sort of well-known bandit algorithms and what we think of adaptive clinical adaptive designs that people do in medicine as well. So it's not a bad model to begin with anyway. And more importantly, I think it captures this key issue of adaptively collected data, which is a future data or future data points that we get to see depend on past outcomes. And this is we'll see is the main problem. Okay, so what does prior theory say? So there's a lot of work on this, but I'll summarize it in sort of two points. The first is that the consistency, so if you just compute the least squares estimate on the full data set under fairly mild conditions, this has good error properties. So you get essentially the dimension over an error rate that you sort of expect. But the distributional properties of this estimate are not what you'd expect. So although this is essentially doing an average of data points simply because the data collection is itself biased, it need not be the case that Theta heart is itself unbiased and Gaussian as you'd expect. So you need an additional condition, which is what lie in way called stability. So what this means is that the sample covariance is essentially deterministic. So it concentrates very well around a deterministic quantity in the case that the data collection procedure happens to be stable. Then in addition to the consistency properties, you have also that Theta heart is going to be Gaussian with the covariance that you expect. So yes, and then there is no log n, right? No, the log n is still there. The log n is a little bit hidden within the sample covariance. So the main key point is that sort of the error rates are relatively robust to adaptive data collection, but the distributional theory is sort of not. And if you want to compute p values or confidence intervals, this is a problem. How this connects, you know, the first sort of simple, relatively simple observation. And this disconnects to the topic of our workshop, which is that in some sense, the high dimensionality of the dataset is related to what we're to this lack of stability. So you can prove that the least squares estimate, you can take a sort of simple counter example where px is standard normal, then the bias of the least square estimate is essentially proportional to the original c to zero up to, so with the proportionality constant that depends on p over n, okay? So when will the bias be negligible? The bias is negligible if p over n is smaller than the standard deviation, which is a 4 to 1 over square root of n. And if you do the calculation, this happens only if n is much bigger than p squared, okay? But if we believe the error rates before, the consistency only needs n is at least p log p. So there's a gap here that basically appears only if the dimension, you know, is comparable with n reasonably. Okay, so this is sort of, and this is the key reason why you see the bias appearing in the simulations that we saw earlier. Okay, so I'm going to go over a very solution to this and to motivate this, I'll go through one step which we call, say, predictable estimator. So let's just think that we're back in dimension one, so instead of being high dimensional, we're back in dimension one, and we wanted to compute an estimate of theta zero, which is now just a scalar. We started with linear estimates, why should we change? Let's still compute a linear estimate. So for any set of weights w i that correspond to the outcomes, you can use the model equation and write theta hat as the truth plus something that's proportional to the truth, plus something that's proportional to the noise. This is true for any set of weights w, it's just an algebraic identity, there's nothing smart happening here, but let's try and isolate the bias. So in particular, let's try and make sure that the bias is just in the first term. So if we think that w i's are constructed online, so w i is only a function of data that is seen up to time i, then by definition, it is going to be that epsilon i is independent of w i, so this term looks basically like a random walk. What that means is whatever the bias in theta hat must concentrate on in the first term. So the bias is going to be whatever appears with this choice of w i in the first term, and then the variance is given by the second part. Of course, this is sort of a simple idea. The key point is that we still haven't really chosen the w i's, we've just come up with one decision data, which is the fact that they must be sort of chosen over time with respecting the time ordering of the data. Deep bias estimators are just a step on this. It's a small twist where you simply do this on the error rather than the estimate itself. It's not very crucial, the calculations are very much the same. So I'll skip that, and you can think of many ways of constructing the weights subject to the decision data that I mentioned. Yes, one minute. Yeah. So what we do in the paper is simply something that optimizes bias and variance. So I think that you've chosen all of the weights up to time i minus one, which is the i-th way to trade off the bias and variance. This is the equation in dimension one. In higher dimensions, it's not particularly hard to generalize this. The interesting thing I think is, say, this is the sort of theorem that you can prove about it that under mild conditions and essentially just the consistency criterion on this upper complexity, the deep bias estimate is going to be Gaussian with variance that is order one, as you'd expect. The important thing is that the stability condition that's required for the data collection process is actually built in into this deep bias test. So you don't really need it as an additional assumption. All you need is that the original pilot estimate is stable. You need something to decide the regularization, but I'll skip that in the interest of time. Recall that let's go back and see whether this actually works. So this is the histogram of the errors that we had for the standard least squares estimate. What happens if we do the deep biasing, you get the blue histograms on top. And as you can see, this corrects automatically the bias at the price of a little bit of variance, which one can show is also necessary, but I'll skip that. I'll skip the normalized errors as well as this. So I think to conclude, I'll just make a few points. I think we've been thinking about online deep biasing as sort of this robust and flexible wrapper for inference. I presented it in the context of least squares estimation, but we have a paper that does this with the last row. So multiple things can be done. We're also working on nonlinear models. The sort of key learning or the key insight or the key difficulty is to try and understand what data do we get to see. So you have to imagine the population from which the data comes from. It's not entirely. And that's why the issue crops up. And this is also something that appears in a lot of other related problems in bandwidth reinforcement learning and causal inference. And I think for myself, one of the important learnings was that modeling data provenance is crucial for providing valid inferential guarantees. And I hope it's something that we in the theory community pay more attention to as time goes on. Thank you.